Combining Knowledge and Data Mining to Understand Sentiment

WHITE PAPER

Combining Knowledge and Data Mining
to Understand Sentiment – A Practical
Assessment of Approaches

COMBINING KNOWLEDGE AND DATA MINING TO UNDERSTAND SENTIMENT

Table of Contents

Abstract............................................................................................................1
Introduction......................................................................................................1
The Elements of Sentiment Analysis...............................................................1
What Is Sentiment Analysis?........................................................................1
When Is It Relevant?.....................................................................................2
Elements of Sentiment Analysis...................................................................2
Sentiment Analysis Methods...........................................................................3
The Data.......................................................................................................3
Data Mining Approach..................................................................................4
Benefits of the data mining approach...............................................................5
Drawback of the data mining approach............................................................5
Natural Language Processing Approach.......................................................5
Step one: taxonomy identification....................................................................6
Step two: defining objects and attributes.........................................................7
Step three: defining polarity..............................................................................8
Benefits of the NLP approach........................................................................10
Drawback of the NLP approach.....................................................................11
The Best of Both Worlds.................................................................................11
Data Mining of the Text for the Rule Builder...............................................11
Hybrid Approaches......................................................................................14
Polarity scores as additional features..............................................................14
Stacked models.............................................................................................15
Results ...........................................................................................................16
Attribute-Level Results...............................................................................16
Overall Results............................................................................................16
Other Applications..........................................................................................17
Importing Models .......................................................................................17
Creating Training Data................................................................................18
Other Capabilities of SAS® Enterprise Miner™............................................19
Conclusions....................................................................................................19
References......................................................................................................20

i


Russell Albright is a Research Statistician Developer at SAS and has been
working on SAS® Text Miner algorithms since its initial release more than 10
years ago. He holds a master’s and a doctorate in applied math from Clemson
University. Albright has expertise in numerical matrix methods and Bayesian
networks, and he has experience applying text mining to many Web-based
sources, including Twitter, Yahoo and PubMed.

Praveen Lakkaraju is a Software Developer at SAS and is a member of the
SAS Text Analytics research and development team. His areas of experience
include sentiment analysis, information retrieval and content categorization.
He was instrumental in the launch of the SAS Social Media Analytics solution,
and is still actively involved in its development. Lakkaraju holds a master’s in
computer science from the University of Kansas, where he specialized in the
field of natural language processing.

ii


Abstract

An important application of text analytics is to automatically characterize the
sentiment of documents in a variety of domains, whether it is positive, negative
or neither. In this paper we explore the benefits of combining domain-specific
linguistic rules with data mining methods to improve both the effectiveness of
your models and the efficiency of the model builder.

Introduction

Our world has changed drastically in the last 10 years. An individual’s opinions
are no longer shared only with his or her immediate family and friends, but
instead are capable of influencing the decisions of thousands or even millions of
people the individual has never even met. The Internet has given the individual a
platform to broadcast grievances and recommendations that can reach across
the world. And the existence of social networks gives these opinions the potential
to snowball into a viral frenzy that can make your company’s products or services
a worldwide boon or a global catastrophe in just a matter of days.

The savvy marketer monitors and evaluates relevant Web content continually to
understand consumer sentiment toward products or services from his company
– and toward his competitors. This attention to Web content allows the company
to respond quickly to customer opinion.

The sheer volume of references related to your company’s products or services
makes automating this task essential. Sources such as blogs, product reviews,
forums and news articles can all be monitored, scored for relevance against your
topics of interest, and then classified according to sentiment. ■ Sentiment analysis is an automatic
method that provides feedback to
you regarding the opinions and
attitudes of your customers.
The Elements of Sentiment Analysis

What Is Sentiment Analysis?

Sentiment analysis is an automatic method that provides feedback to you
regarding the opinions and attitudes of your customers. The analysis is based
on customers’ electronic written commentaries regarding your products and
services and those of your competitors. The feedback can be provided at a
very high level with drill-down so that you can explore how opinions differ within
groups, subgroups and even at the individual level.

1


More precisely, sentiment analysis is the process of classifying or rating the opinions
or sentiment expressed in a document. The rating may assign the sentiment into
one of three categories: positive, negative or neutral; or it may, instead, assign a
numeric score. The rating that is assigned is termed polarity. The sentiment may be
assessed for the entire document or for particular objects or attributes mentioned in
the document.

When Is It Relevant?

Sentiment analysis is relevant in almost every context that your customers or
potential customers express themselves in written form – and possibly spoken form –
via different communication channels. These comments may not have been intended
for direct consumption by your company. They may have been posted in website
forums, tweets, blogs or other Web pages and directed toward your potential
customers. On the other hand, some content may have been intentionally directed at
your company through e-mail, a company support website, a survey questionnaire, a
call center desk, etc.

Automated sentiment analysis is important to implement when you are inundated
with relevant, useful feedback through these channels. For many companies, it
is impossible for individuals to monitor and understand all that is communicated
in these sources due to their sheer volume. The information comes too quickly
and from too many channels. Sentiment analysis provides you with an immediate
interpretation, not just of every individual comment but also of the global opinions
expressed.

Elements of Sentiment Analysis

You cannot implement a comprehensive sentiment analysis solution with a process
that merely analyzes the sentiment of a document. Instead, you must coordinate
several tasks to maximize the benefits.

1. Data acquisition phase. This phase involves setting up an automated process to
obtain a clean set of documents to analyze. You can use SAS software to obtain
the documents from the Internet and from local file systems or databases. SAS
software can also be used to filter the documents by eliminating any “noise” that
is common to Web documents (e.g., filtering spam).

2. Sentiment assignment phase. This phase involves creating a model that can
calculate the polarity of the author’s sentiment or opinion toward your topics of
interest and apply that model to naïve documents. SAS technologies can help you
derive accurate assessments of sentiment.

3. Summarization and reporting phase. Identifying sentiment within a particular
document is interesting in itself, but frequently it will be of more interest to
characterize representative populations within your collection. SAS provides
techniques for such exploration, which entails answering questions such as:

2


• oes the age of our customer tend to make a difference in his or her opinion
D
about our service?

• ow do the cumulative opinions about our competitor’s product compare with
H
the cumulative opinions about our product?

• id our customers perceive the changes we made to our outlet stores as
D
beneficial, or not?

4. Repetition phase. The final step in your sentiment analysis project will be to set
up a process to automate the entire analysis on a repeated basis. This allows you
to monitor sentiment changes, identify important influencers and respond quickly
to what you learn.

For this paper we will focus primarily on the sentiment assignment phase. Note
that since text is written in natural language and not with a precise quantitative
representation, there are many challenges to effectively analyze for sentiment.

For one, natural language text is full of ambiguities, implicit meaning and subtle
nuances. Normally a human reader has the necessary experience to both
understand natural language expressions and to comprehend the meaning of the
subject area along with the sentiment the author intended to communicate. But
automating this process in a computer can be challenging. Such things as slang,
pronoun resolution, sarcasm and idioms all make a direct interpretation of the text
difficult.

Further, an automatic process will not function at the semantic level of the text at all
unless there is a direct mapping of a linguistic rule to semantics. In many instances
this can be captured with the rules we will discuss later; but the diversity of ways to
express the same meaning can make it difficult to accurately capture all situations
with a set of rules.

There are two primary approaches to building models for sentiment analysis. The
first, natural language processing, uses a domain expert to build a set of linguistic
rules to determine the sentiment polarity of the document’s content. The second,
machine learning, uses training data (documents that have the sentiment polarity
already assigned to them) to build a predictive model. Predictive models such as
decision trees, logistic regressions or neural networks will make this prediction on
documents that are outside the training set.

Sentiment Analysis Methods

The Data

We will use two collections of movie review data to demonstrate the techniques
presented in this paper. The first collection created by Pang and Lee contains 2,000

3


movie reviews. The collection is split evenly with 1,000 positive and 1,000 negative
reviews.1 The second collection was obtained by retrieving 6,631 movie reviews
from Yahoo.2 This collection has both overall ratings for the movie being discussed
and also ratings for several attributes of each movie, including the story line, cast,
direction and visuals.

Although your data is almost certainly not movie review data, the concepts and
techniques demonstrated using this movie data are applicable to most other
sentiment-related text data sets.

Data Mining Approach

A data mining approach to sentiment analysis translates an unstructured text
problem to one that makes predictions on structured, quantitative data. The
approach borrows several techniques from computational linguistics and information
retrieval communities to represent the text numerically, and then applies traditional
data mining techniques to this numeric representation. In the end, a target variable is
identified and a pattern is discovered from the training data for predicting sentiment
polarity. This pattern can then be used to predict new observations.

The first step in creating the numeric representation is to convert the entire training
collection into a document-by-term frequency matrix. Each document is parsed into
individual terms, or term/part-of-speech pairs. Then the set of all terms becomes
the variables on the data set so that documents are now represented as vectors of
length equal to the number of distinct terms in the collection. These vectors are very
sparse, containing mostly zeroes – because any one document contains a very small
percentage of the terms in the collection. Once the documents are represented as
vectors, the frequencies in each cell can be weighted with a function that takes into
account the distribution of the term across the collection and relative to the levels of
the target variable.

After these document vectors are formed, a dimension reduction technique – such
as the singular value decomposition (see Taming Text with the SVD, Albright, 2004)
– is typically used to represent each document in a reduced-dimensional space
of maybe 50 to 100 variables, where each variable is a linear combination of the
weighted terms that originally represented each document.

Finally, these reduced-dimensional vectors, together with the sentiment variable, can
be supplied to a predictive model. The model will attempt to learn from the training
data by utilizing patterns in the reduced-dimensional vector. This predictive model will
then create a function that will predict the sentiment for any document.

1
The Pang and Lee movie review data is available at: http://www.cs.cornell.edu/People/pabo/movie-
review-data
2
Yahoo movie reviews were obtained from: http://movies.yahoo.com

4


Benefits of the data mining approach

The data mining approach is appealing because it is based on learning patterns that
are useful for making automated, efficient predictions. The algorithms are capable
of discovering unimagined and complicated patterns that would be beyond what a
human could anticipate. Frequently, a data mining approach can beat a rule-based
approach in topic classification. Of course, this is dependent on having enough
training data to build the model.

Drawback of the data mining approach

The vector-based representation of a document, which is required for data mining ■ The algorithms are capable of
techniques, does not maintain information that is potentially important to sentiment discovering unimagined and
classification. For example, the vector representation does not capture when terms
complicated patterns that would
are close to one another in the document, if one term precedes another or any other
contextual cues. The order of terms in a phrase can significantly affect meaning. be beyond what a human could
Consider the phrases: anticipate.

“… night for a great movie”

and

“… great night for a movie”

These two phrases convey two different meanings; yet in a vector representation, the
phrases have an identical representation.

In addition, most predictive models provide little feedback to the user as to precisely
why a particular document was classified as having positive or negative polarity. So
when you attempt to understand what positive things people said in a particular
document, you frequently have to read the entire document to discover the answer.

As a final drawback, forming the training and validation is an essential component
of learning a predictive model, but it can be very time-consuming and challenging.
A rating needs to be provided for every document, and if there are attributes of
documents that you wish to use to measure sentiment, you will need to provide a
rating for each of these as well. Another complication is that two different reviewers
frequently assign two different sentiment ratings to the same document. This can
introduce unexpected errors in building and measuring the performance of your
model.

Natural Language Processing Approach

Natural language processing (NLP) is a field of artificial intelligence that deals with
automatically extracting meaning from natural language text. As discussed in the
introduction of this paper, it’s very challenging to get machines to understand text at
the same levels as humans. Doing this with the specific goal of extracting sentiment
is even more challenging. For example, consider the text snippet below:
5


“… with that out of the way, let me say this – this film is bad. This film is really, really
bad. Yet somehow, it is strangely enjoyable. …”

If interpreted by a human, the above text would imply a positive sentiment from the
author toward the movie. However, it can be very challenging to get the same output
from a computer because of the dense presence of the strongly negative words.

The rule-based NLP methods use certain entities and syntactic patterns in the text
to understand its meaning. SAS Sentiment Analysis provides all the tools needed
for this kind of disambiguation. You can use a combination of language dictionaries,
linguistic constructs like parts of speech, and noun phrases along with a range of
operators.

The operators fall into a few different categories as shown below:

• Boolean operators. Used to include or exclude different entities (e.g., AND, OR,
NOT).

• Frequency operators. Used to measure the specified number of occurrences of
certain entities, (e.g., MIN, MINOC, MAXOC).

• Context operators. Used to measure the context within which certain entities
occur in the text (e.g., DIST, START, END, SENT, PARA).

• Sequence operators. Used to look for the entities in a specific sequence (e.g.,
ORD, ORDDIST).

The process of developing rule-based models for sentiment analysis involves a few
different steps. These are explained below.

Step one: taxonomy identification

The initial step in the NLP approach is taxonomy identification. Taxonomy here
refers to a simple, two-level hierarchy where you specify the different objects and
attributes for which you want to extract sentiment. You can either use a predefined
taxonomy or you can use text mining to learn the most prominent objects and their
attributes in the corpus and then make them part of your taxonomy. Figure 1 shows
the predefined taxonomy that we used for extracting sentiment from the movie review
data. The discovery-based text mining methods are discussed later in this paper.

6


Figure 1: Taxonomy for movie reviews.

Step two: defining objects and attributes

The next step is to define the objects and their attributes. A basic approach to
defining these is to identify their synonyms or the different ways they may be referred
to in the text. Figure 2 shows an example.

Figure 2: Example of defining the visuals attribute.

While this approach captures many cases, in other situations the attribute might be
referred to using its co-referent. Consider the example below:

“The movie starred Jennifer Aniston. The plot of the movie was very interesting.
Aniston’s performance was commendable. She looks adorable.”

7


Here the name of the actress was mentioned only in the first sentence. In the
subsequent sentences, the actress was referred to using her last name and
a pronoun. These three entities are said to be co-referent and the process of
identifying them is called co-reference resolution. The rule-based methods allow you
to write rules to handle such cases.

Step three: defining polarity

Polarity is determined by associating predefined positive or negative terms or
expressions with the attributes that have been identified. Dictionaries of subjective
expressions are available and can be customized to specific domains (see Figure 3).

Figure 3: Example of a generic dictionary of positive keywords.

You could also define multiple classes of subjective expressions to denote different
levels of subjectivity.

“incredible,” “stunning” ➔ strong positive
“hate,” “disgust” ➔ strong negative
Assigning the appropriate polarity requires that negations are handled properly. To do
this, you can use a combination of part-of-speech tags and dictionaries as shown in
Figures 4 and 5.

8


Figure 4: Example of a class of negated adjectives.

In Figure 4, “NegClass” is a dictionary of expressions that denote a negation. For
example, “not,” “will not,” “have not,” etc. and “:Adv,” “:A” and “:V” represent any
adverb, adjective and verb respectively.

Figure 5: Example of a negation rule.

Finally, to extract the sentiment at attribute level, you can write context-based rules
as shown in Figure 6, where we used a combination of operators.

9


■ The major advantage of rule-based
methods is the amount of control
they give rule developers over how
the analysis will be performed.

Figure 6: Example of an attribute-level sentiment rule.

Benefits of the NLP approach

The major advantage of rule-based methods is the amount of control they give
rule developers over how the analysis will be performed. Developers can use their
knowledge of the domain and the language within it to develop rules that have high
precision.

Unlike statistical analysis, the results of rule-based analysis are easily interpretable.
This is very important for real-life applications where the analysts need to know
exactly why a document or an attribute within a document was tagged as positive or
negative. In other words, analysts need to know exactly what sentences, keywords
or context within the document triggered the positive or negative sentiment. Figure 7
shows an example of this.

I think they did a fantastic job this movie. I read the book, I loved the book, and I loved the movie!
My only qualm was Javier bardem playing a Brazilian when he is SPANISH! Julia Roberts was
perfect and beautfiul. Wonderful casting job (with the exception of Bardem)! Good acting. Some
parters were a tad confusing for those who haven’t read the book. But I took my mom, who didn’t
read the book, and she really liked it. br/
br/
It’s not just some sappy chick flick. It’s a powerful journey about finding yourself hen you let
yourself GO!br/
br/
Empowering.br/
Perfection. = EAT PRAY LOVE!br/
Lovely

Figure 7: Example showing different entities that were used for rule-based analysis.

Rule-based methods are completely unsupervised; that is, they do not require any
training data. This is a big advantage in real-life applications where training data is
scarce. The non-availability of training data is more pronounced when it comes to
granular sentiment analysis (sentiment derived at the objects and attributes level).

10


Another advantage of rule-based methods is their ability to refine the rules over time
based on the feedback from analysts or subject-matter experts. The more time the
rule developer spends on refining the rules, the better the results. Language evolves
over time and people start using newer terms to express their sentiments. This is
especially true for social media, where the language used changes all the time. In
such cases, rule-based methods give you the flexibility needed to adjust your models
accordingly.

Drawback of the NLP approach

The disadvantage of rule-based methods is that they require a lot of human
involvement in developing the rules. These methods completely rely on the domain
knowledge of rule developers. It might take a few weeks to come up with a strong
rule-based model for a new domain. However, once you have a strong rule-based
model for a domain, you can reuse that model with some minor modifications for
different applications within the domain.

The importance of validation data is often underestimated while developing these
models. The rules being written must be generic enough so that they are capable
of handling all possible cases. Inexperienced rule developers tend to over-fit their
rules to the sample data they are working with. Such rules might not work well when
tested on different data sets. So, rule developers must make sure they validate the
rules on different data sets before considering a model ready to deploy.

The Best of Both Worlds

As we discussed earlier, data mining learns relevant patterns from a numerical
representation of the entire collection, and the patterns discovered are derived by
analyzing the collection as a whole. The rule builder, on the other hand, relies only
■ Because they approach the problem
on personal experience and knowledge to formulate rules that will be useful for
sentiment analysis. so differently, data mining and rule-
based systems can complement one
Because they approach the problem so differently, data mining and rule-based another.
systems can complement one another. They can do this in two ways. First,
unsupervised data mining can be used as a tool for the rule builder; and second, the
supervised data mining model can be combined with the rule-based model in such
a way that the strengths of each model are combined, and any possible mistakes
made by one model can be corrected by the other.

Data Mining of the Text for the Rule Builder

The challenge of the rule builder is to devise and formulate rules that capture the
sentiment contained in the collection. To do this, the rule builder must have some
understanding of the content of the documents that are being categorized. For

11


instance, in our movie review collection, are all the reviews about a specific movie or
are they about a specific genre of movies? If we know, we can save time by writing
rules that are only directed to a particular movie or genre. On the other hand, if the
reviews are about movies from many different genres, we must consider how that
knowledge affects the rules we write. Otherwise, we might not capture the sentiment
accurately.

For instance, when discussing a horror movie, the statement
“The scariest thing I have ever seen”

is typically an indicator that the reviewer enjoyed the movie. But it could be a negative
indicator if the reviewer was discussing a children’s movie.

Unsupervised text mining allows you to quickly get a handle on the collection you
are examining without spending time reading many individual documents. SAS
Text Miner provides a node both for generating topics within a document and for
clustering the documents. These approaches are useful for understanding the
collection and for revealing significant aspects of the data. Table 1 shows that our
collection is quite varied.

ID Descriptive Terms Freq. Pct.
1 + horror, + killer, + scary, + scream, horror, + reason, last, 155 8%
minutes
2 + animation, adults, animated, disney, voice, children, 73 4%
kids, + feature
3 coen, fargo, money, wife, different, pretty, sequences, 37 2%
guy
4 + war, world, life, love, + sense, + fight, right, + father 267 13%
5 + comedy, jokes, + funny, funny, fun, script, back, cast 213 11%
6 earth, effects, special effects, special, star, + action, + 276 14%
people, interesting
7 + action, + fight, sequences, bad, fun, guy, special ef- 177 9%
fects, acting
8 + comedy, mother, + father, woman, funny, love, + family, 400 20%
high
9 performances, mother, performance, love, down, + point, 117 6%
last, different
10 + thriller, case, + action, + killer, wife, + job, performance, 285 14%
script

Table 1: Ten clusters from the Pang and Lee data.

The clusters reveal several prominent categories of movies, reminding rule builders
that they need to consider how people express sentiment in the following types of
movies:

• Horror movies.

• Animation and children’s movies.

12


• Comedies.

• Science fiction movies.

• Action movies.

• Thrillers.

If you, as the rule builder, had not been thinking of how people express their opinions
about movies from these different categories, it could be easy to incorrectly capture
the sentiment contained in them.

Further discovery can be done to capture the sentiment of individual attributes
within the document. For instance, since the SAS Text Miner filter node allows you
to subset documents that contain the visual attribute synonyms displayed in Figure
2, you can subset the collection accordingly. In Figure 8, the search expression has
been set to include only those documents that contain at least one of the visual
attribute synonyms used in the rule building. The special character “*” implies a
wildcard search is to occur, and the quoted input means that only the exact phrase,
“special effects,” should match. The filter node can be followed with a clustering
or topic node, and then any analysis of this subsetted collection provides you with
some potential new ideas for rules.

Figure 8: A search expression to retrieve documents concerned with the visual sentiment
attribute.

This particular subsetted collection revealed discussions around costumes and
costume designs, as well as the reviewer’s reaction to the theater setting. Neither of
these were aspects of visual sentiment that we had considered prior to discovering
these topics.

At an even finer level, the reports of important terms and phrases (particularly in
relation to one another in the concept-linking diagram) provide sentence-level
ideas for your rule generation. The diagram in Figure 9 was made in the process of
exploring reviewers’ comments on their theater experience. The diagram suggests
that the sentiment regarding the music or sound in the movie might be another
attribute that could be added to the taxonomy and examined.

13


Figure 9: A concept link diagram of “music” and “loud.”

Hybrid Approaches
■ Hybrid approaches involve using
Hybrid approaches involve using a rule-based approach and a data mining approach
a rule-based approach and a data
in combination. In the next sections we will describe two alternative methods. The
mining approach in combination.
first method can be used to supplement the features from the traditional data mining
model by adding features derived from the linguistic rules that are triggered. The
second method shows how to use an ensemble of the results of the two distinct
approaches to improve the prediction.

Polarity scores as additional features

One advantage of SAS Text Miner is that it allows additional features associated with
the document to be combined with the term features or with the SVD dimensions
before training the predictive model. Polarity scores are simply a summary score
based on a function of the number of times the positive and the negative rules trigger
in a document, or in an attribute of a document. These values can be obtained from
SAS Sentiment Analysis.

14


Once obtained, the logistic function can be applied to the ratio of the weighted
positive and negative counts so that a document’s polarity score will be between 0
and 1, inclusively. A document with more positive sentiment weight will be assigned
a score closer to 1, and a document that tends to have more negative sentiment
scores closer to 0. This score is then used in combination with the SVD dimensions.

When the document has several attributes that receive a polarity score, each of
these scores can be added as features to the text mining model. The hybrid model
within SAS Sentiment Analysis software also makes use of this approach.

Stacked models

Another hybrid approach is to stack the models. This means that the rule-based and
the data mining models are run separately in the first stage; but a second, predictive
model is “stacked” after these two models so that the output of the two (a predictive
probability for each document from each model) becomes the input into a second-
stage model.

Stacking is an ensemble method that can improve accuracy if the two first-stage
models differ in their predictions. Stacking allows for the two models to potentially
correct one another where they differ.

In Figure 10, SAS Text Miner is used to build one sentiment model, while the model
import node brings in a model from SAS Sentiment Analysis. The output of the
two models is massaged with SAS code, and then goes into the second stage
regression for a final prediction.

Figure 10: Stacking models.

15


Results

We experimented with the sentiment analysis approaches presented in this paper
using the movie review data sets. The Yahoo movie data set was used to analyze
sentiment at the attribute level, and the Pang and Lee data set was used for the
overall sentiment predictions.

Attribute-Level Results

Table 2 shows the results for the attribute-level sentiment analysis on the Yahoo
movie data. The Yahoo data had explicit user ratings for the different attributes,
and we compared those ratings with the predictions made by the rule-based
model developed with SAS Sentiment Analysis. We spent three days on the rule-
development process. The Yahoo data included some reviews where a user rating
was available for a particular attribute, but the attribute itself was not discussed
in the text of the review. We did not include such reviews in the evaluation of the
attribute. We also did not include the general attribute because no user ratings were
available for it. A user rating of C+ or higher was considered positive, and C- or
lower was considered negative.

Num Reviews Misclass Rate
Story 972 .23
Cast 1272 .14
Direction 243 .17
Visuals 459 .12
Aggregate 2946 .18

Table 2: Attribute-level results.

With just three days of effort on rule development, we were able to achieve an
overall precision of 82 percent at the attribute level. The misclassification rate for the
story attribute was relatively higher than the other attributes. That is an indication to
the rule developer to further refine the rules for that attribute. Rule refinement is an
ongoing process, and precision can improve over a period of time.

Overall Results

Table 3 shows the results of our comparisons of the Pang and Lee data. For the
data mining approach, 1,800 random movie reviews were used for training a model,

16


and 200 reviews were held out to be scored. This process was repeated four times,
and the misclassification scores were averaged. For each run, the same set of 200
reviews was analyzed in SAS Sentiment Analysis so that the comparisons were
made on the same set of data.

Approach Misclass Rate
1 SAS Text Miner .144
2 SAS Sentiment Analysis .252
Attribute-Level Rules
3 Add Polarity Scores as .132
Features in SAS Text
Miner
4 Blended .139

Table 3: Overall sentiment misclassification results.

The results obtained with the text mining model were achieved by using a category-
specific weighting and by having enough training data. The SAS Sentiment Analysis
overall sentiment model was derived from the rules for the individual attributes.
Under these conditions, the rule-based model did not perform as well as the SAS
Text Miner model. However, combining the models – by using the polarity scores as
features in the SAS Text Miner model, or by blending the two models – did improve
results.

Other Applications

Importing Models

SAS Sentiment Analysis can build a hybrid model using rules combined with a Naïve
Bayes algorithm. However, to leverage all the predictive analysis advantages of
SAS® Enterprise Miner™ software, the models from SAS Sentiment Analysis must
be imported into SAS Enterprise Miner. This can be done easily by using the SAS
Enterprise Miner model import node. Once the output of SAS Sentiment Analysis
is imported, models can be combined in various ways and then compared with
the model assessment node. Figure 11 shows the receiver operator curve (ROC)
plot from the model assessment node after a SAS Sentiment Analysis model was
imported.

17


Figure 11: ROC chart of SAS Enterprise Miner models with an imported SAS Sentiment ■ One approach to creating training
Analysis model (denoted by model import). In this graph, “TM” denotes SAS Text Miner
and “RuleIn” refers to using SAS Sentiment Analysis rules in conjunction with data is to use very precise rules that
SAS Text Miner.
will make a sentiment classification
only on the documents you are most
Creating Training Data sure about.

As discussed earlier, training data that has the “answers” is an essential part of a
text mining approach. It is necessary to build a predictive model that can make
accurate sentiment predictions. It is also important for a rule-based system because
it validates how your rules are doing. The feedback lets you know if you need to
add or remove specific rules, or if you must refine certain rules. Unfortunately,
training data is not always available, and creating this data can be an expensive time
commitment.

One approach to creating training data is to use very precise rules that will make a
sentiment classification only on the documents you are most sure about. At the risk
of not assigning a sentiment category to many of the documents, you do assign
sentiment to a small subset of documents.

18


We applied this approach to the movie review data by choosing rules that captured
complete phrases that seemed, in our opinion, to indicate the overall sentiment. For
instance, we included a set of rules that would trigger a positive score for a review
that contained phrases like:

“I thoroughly enjoyed this movie.” or “I totally loved the film.”

When these types of phrases occurred in the document, the polarity was rated
positive. Similarly, corresponding precise rules were added for negative polarity.

When we applied this approach to our movie review collection, 103 of the 2,000
documents triggered our rules. (While 103 documents is too small for an effective set
of training data, with a larger pool of 20,000 reviews we would have likely obtained
1,000 documents in the training set.) We still confirmed the polarity by reviewing
each of the 103 documents. Since SAS Sentiment Analysis highlights the rules in
context, it was quick work to check the 103 documents to ensure that it was an
appropriate trigger. Based on our manual review, it appeared that eight of the 103
documents were incorrect, so we corrected the polarity for those so that our training
data would be free of errors.

Other Capabilities of SAS® Enterprise Miner™

This paper has primarily focused on combing the rule-based capabilities of SAS
Sentiment Analysis with the text mining capabilities of SAS Text Miner, in conjunction
with the predictive models available in SAS Enterprise Miner. There is much more
functionality in SAS Enterprise Miner that can be used to help you understand
the sentiment contained in a collection and to build on the rule models you have
developed. Such functionality as sequences and associations, decision trees, SOM-
Kohonen self-organizing maps, variable clustering, transformations and sampling,
and statistical exploration have all been used in various contexts to supplement
textual understanding.

Conclusions

Independently, both the domain knowledge and the data mining approaches to
sentiment analysis have their strengths and weaknesses; but hopefully you will not
be forced to choose between using one or the other for your analysis. In this paper,
we have shown that the two approaches complement one another. So, while the
NLP approach leverages the rule builder’s domain knowledge, text mining can also
be used by that person to improve, clarify or correct how that knowledge relates to
the particular collection being analyzed. Text mining reveals important patterns in the
specific collection that assist the rule builder.

19


On the other hand, the text mining approach allows you to quickly build a sentiment
classifier with term frequencies alone. But without any semantic or syntactic
indicators, mistakes that would seem elementary to a human can easily occur. We
have shown that these linguistic indicators can be captured by a rule-base system
and then leveraged in the statistical classifier as additional features, or as a blended
model. The end result is a model that is better than either one individually.

References
1
Albright, Russ. Taming Text with the SVD. January 2004. SAS: Cary, NC. Web:
http://ftp.sas.com/techsup/download/EMiner/TamingTextwiththeSVD.pdf.

2
Pang et al. “Thumbs Up? Sentiment Classification Using Machine Learning
Techniques.” Proceedings of the Conference on Empirical Methods in Natural
Language Processing (EMNLP). Conference on Empirical Methods in Natural
Language Processing. 2002. 79-86.

The authors thank James Cox and Janardhana Punuru from the SAS Text
Analytics Research and Development team for their helpful comments
and suggestions. They also thank Fiona McNeill from SAS Marketing for
encouraging them to work on this paper and providing valuable feedback.

20

SAS Institute Inc. World Headquarters +1 919 677 8000
To contact your local SAS office, please visit: www.sas.com/offices
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA
and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies.
Copyright © 2011, SAS Institute Inc. All rights reserved. 105008_S59083.0211

Combining Knowledge and Data Mining to Understand Sentiment

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Combining Knowledge and Data Mining to Understand Sentiment

Similar a Combining Knowledge and Data Mining to Understand Sentiment (20)

Más de C.Y Wong

Más de C.Y Wong (20)

Último

Último (20)

Combining Knowledge and Data Mining to Understand Sentiment