SlideShare una empresa de Scribd logo
1 de 24
Descargar para leer sin conexión
WHITE PAPER




Combining Knowledge and Data Mining
to Understand Sentiment – A Practical
Assessment of Approaches
COMBINING KNOWLEDGE AND DATA MINING TO UNDERSTAND SENTIMENT




Table of Contents

Abstract............................................................................................................1
Introduction......................................................................................................1
The Elements of Sentiment Analysis...............................................................1
   What Is Sentiment Analysis?........................................................................1
   When Is It Relevant?.....................................................................................2
   Elements of Sentiment Analysis...................................................................2
Sentiment Analysis Methods...........................................................................3
   The Data.......................................................................................................3
   Data Mining Approach..................................................................................4
        Benefits of the data mining approach...............................................................5
        Drawback of the data mining approach............................................................5
    Natural Language Processing Approach.......................................................5
        Step one: taxonomy identification....................................................................6
        Step two: defining objects and attributes.........................................................7
        Step three: defining polarity..............................................................................8
        Benefits of the NLP approach........................................................................10
        Drawback of the NLP approach.....................................................................11
The Best of Both Worlds.................................................................................11
  Data Mining of the Text for the Rule Builder...............................................11
  Hybrid Approaches......................................................................................14
        Polarity scores as additional features..............................................................14
        Stacked models.............................................................................................15
Results	...........................................................................................................16
   Attribute-Level Results...............................................................................16
   Overall Results............................................................................................16
Other Applications..........................................................................................17
   Importing Models .......................................................................................17
   Creating Training Data................................................................................18
   Other Capabilities of SAS® Enterprise Miner™............................................19
Conclusions....................................................................................................19
References......................................................................................................20




                                                                                                                   i
COMBINING KNOWLEDGE AND DATA MINING TO UNDERSTAND SENTIMENT




     Russell Albright is a Research Statistician Developer at SAS and has been
     working on SAS® Text Miner algorithms since its initial release more than 10
     years ago. He holds a master’s and a doctorate in applied math from Clemson
     University. Albright has expertise in numerical matrix methods and Bayesian
     networks, and he has experience applying text mining to many Web-based
     sources, including Twitter, Yahoo and PubMed.

     Praveen Lakkaraju is a Software Developer at SAS and is a member of the
     SAS Text Analytics research and development team. His areas of experience
     include sentiment analysis, information retrieval and content categorization.
     He was instrumental in the launch of the SAS Social Media Analytics solution,
     and is still actively involved in its development. Lakkaraju holds a master’s in
     computer science from the University of Kansas, where he specialized in the
     field of natural language processing.




ii
COMBINING KNOWLEDGE AND DATA MINING TO UNDERSTAND SENTIMENT




Abstract

An important application of text analytics is to automatically characterize the
sentiment of documents in a variety of domains, whether it is positive, negative
or neither. In this paper we explore the benefits of combining domain-specific
linguistic rules with data mining methods to improve both the effectiveness of
your models and the efficiency of the model builder.




Introduction

Our world has changed drastically in the last 10 years. An individual’s opinions
are no longer shared only with his or her immediate family and friends, but
instead are capable of influencing the decisions of thousands or even millions of
people the individual has never even met. The Internet has given the individual a
platform to broadcast grievances and recommendations that can reach across
the world. And the existence of social networks gives these opinions the potential
to snowball into a viral frenzy that can make your company’s products or services
a worldwide boon or a global catastrophe in just a matter of days.

The savvy marketer monitors and evaluates relevant Web content continually to
understand consumer sentiment toward products or services from his company
– and toward his competitors. This attention to Web content allows the company
to respond quickly to customer opinion.

The sheer volume of references related to your company’s products or services
makes automating this task essential. Sources such as blogs, product reviews,
forums and news articles can all be monitored, scored for relevance against your
topics of interest, and then classified according to sentiment.                      ■ 	Sentiment analysis is an automatic
                                                                                        method that provides feedback to
                                                                                        you regarding the opinions and
                                                                                        attitudes of your customers.
The Elements of Sentiment Analysis


What Is Sentiment Analysis?

Sentiment analysis is an automatic method that provides feedback to you
regarding the opinions and attitudes of your customers. The analysis is based
on customers’ electronic written commentaries regarding your products and
services and those of your competitors. The feedback can be provided at a
very high level with drill-down so that you can explore how opinions differ within
groups, subgroups and even at the individual level.




                                                                                                                             1
COMBINING KNOWLEDGE AND DATA MINING TO UNDERSTAND SENTIMENT




More precisely, sentiment analysis is the process of classifying or rating the opinions
or sentiment expressed in a document. The rating may assign the sentiment into
one of three categories: positive, negative or neutral; or it may, instead, assign a
numeric score. The rating that is assigned is termed polarity. The sentiment may be
assessed for the entire document or for particular objects or attributes mentioned in
the document.



When Is It Relevant?

Sentiment analysis is relevant in almost every context that your customers or
potential customers express themselves in written form – and possibly spoken form –
via different communication channels. These comments may not have been intended
for direct consumption by your company. They may have been posted in website
forums, tweets, blogs or other Web pages and directed toward your potential
customers. On the other hand, some content may have been intentionally directed at
your company through e-mail, a company support website, a survey questionnaire, a
call center desk, etc.

Automated sentiment analysis is important to implement when you are inundated
with relevant, useful feedback through these channels. For many companies, it
is impossible for individuals to monitor and understand all that is communicated
in these sources due to their sheer volume. The information comes too quickly
and from too many channels. Sentiment analysis provides you with an immediate
interpretation, not just of every individual comment but also of the global opinions
expressed.


Elements of Sentiment Analysis

You cannot implement a comprehensive sentiment analysis solution with a process
that merely analyzes the sentiment of a document. Instead, you must coordinate
several tasks to maximize the benefits.

1.	 Data acquisition phase. This phase involves setting up an automated process to
    obtain a clean set of documents to analyze. You can use SAS software to obtain
    the documents from the Internet and from local file systems or databases. SAS
    software can also be used to filter the documents by eliminating any “noise” that
    is common to Web documents (e.g., filtering spam).

2.	 Sentiment assignment phase. This phase involves creating a model that can
    calculate the polarity of the author’s sentiment or opinion toward your topics of
    interest and apply that model to naïve documents. SAS technologies can help you
    derive accurate assessments of sentiment.

3.	 Summarization and reporting phase. Identifying sentiment within a particular
    document is interesting in itself, but frequently it will be of more interest to
    characterize representative populations within your collection. SAS provides
    techniques for such exploration, which entails answering questions such as:




2
COMBINING KNOWLEDGE AND DATA MINING TO UNDERSTAND SENTIMENT




   •	  oes the age of our customer tend to make a difference in his or her opinion
      D
      about our service?

   •	  ow do the cumulative opinions about our competitor’s product compare with
      H
      the cumulative opinions about our product?

   •	  id our customers perceive the changes we made to our outlet stores as
      D
      beneficial, or not?

4.	 Repetition phase. The final step in your sentiment analysis project will be to set
    up a process to automate the entire analysis on a repeated basis. This allows you
    to monitor sentiment changes, identify important influencers and respond quickly
    to what you learn.

For this paper we will focus primarily on the sentiment assignment phase. Note
that since text is written in natural language and not with a precise quantitative
representation, there are many challenges to effectively analyze for sentiment.

For one, natural language text is full of ambiguities, implicit meaning and subtle
nuances. Normally a human reader has the necessary experience to both
understand natural language expressions and to comprehend the meaning of the
subject area along with the sentiment the author intended to communicate. But
automating this process in a computer can be challenging. Such things as slang,
pronoun resolution, sarcasm and idioms all make a direct interpretation of the text
difficult.

Further, an automatic process will not function at the semantic level of the text at all
unless there is a direct mapping of a linguistic rule to semantics. In many instances
this can be captured with the rules we will discuss later; but the diversity of ways to
express the same meaning can make it difficult to accurately capture all situations
with a set of rules.

There are two primary approaches to building models for sentiment analysis. The
first, natural language processing, uses a domain expert to build a set of linguistic
rules to determine the sentiment polarity of the document’s content. The second,
machine learning, uses training data (documents that have the sentiment polarity
already assigned to them) to build a predictive model. Predictive models such as
decision trees, logistic regressions or neural networks will make this prediction on
documents that are outside the training set.




Sentiment Analysis Methods


The Data

We will use two collections of movie review data to demonstrate the techniques
presented in this paper. The first collection created by Pang and Lee contains 2,000


                                                                                                                3
COMBINING KNOWLEDGE AND DATA MINING TO UNDERSTAND SENTIMENT




movie reviews. The collection is split evenly with 1,000 positive and 1,000 negative
reviews.1 The second collection was obtained by retrieving 6,631 movie reviews
from Yahoo.2 This collection has both overall ratings for the movie being discussed
and also ratings for several attributes of each movie, including the story line, cast,
direction and visuals.

Although your data is almost certainly not movie review data, the concepts and
techniques demonstrated using this movie data are applicable to most other
sentiment-related text data sets.



Data Mining Approach

A data mining approach to sentiment analysis translates an unstructured text
problem to one that makes predictions on structured, quantitative data. The
approach borrows several techniques from computational linguistics and information
retrieval communities to represent the text numerically, and then applies traditional
data mining techniques to this numeric representation. In the end, a target variable is
identified and a pattern is discovered from the training data for predicting sentiment
polarity. This pattern can then be used to predict new observations.

The first step in creating the numeric representation is to convert the entire training
collection into a document-by-term frequency matrix. Each document is parsed into
individual terms, or term/part-of-speech pairs. Then the set of all terms becomes
the variables on the data set so that documents are now represented as vectors of
length equal to the number of distinct terms in the collection. These vectors are very
sparse, containing mostly zeroes – because any one document contains a very small
percentage of the terms in the collection. Once the documents are represented as
vectors, the frequencies in each cell can be weighted with a function that takes into
account the distribution of the term across the collection and relative to the levels of
the target variable.

After these document vectors are formed, a dimension reduction technique – such
as the singular value decomposition (see Taming Text with the SVD, Albright, 2004)
– is typically used to represent each document in a reduced-dimensional space
of maybe 50 to 100 variables, where each variable is a linear combination of the
weighted terms that originally represented each document.

Finally, these reduced-dimensional vectors, together with the sentiment variable, can
be supplied to a predictive model. The model will attempt to learn from the training
data by utilizing patterns in the reduced-dimensional vector. This predictive model will
then create a function that will predict the sentiment for any document.




1
    	 The Pang and Lee movie review data is available at: http://www.cs.cornell.edu/People/pabo/movie-
      review-data
2
    	 Yahoo movie reviews were obtained from: http://movies.yahoo.com

4
COMBINING KNOWLEDGE AND DATA MINING TO UNDERSTAND SENTIMENT




Benefits of the data mining approach

The data mining approach is appealing because it is based on learning patterns that
are useful for making automated, efficient predictions. The algorithms are capable
of discovering unimagined and complicated patterns that would be beyond what a
human could anticipate. Frequently, a data mining approach can beat a rule-based
approach in topic classification. Of course, this is dependent on having enough
training data to build the model.



Drawback of the data mining approach

The vector-based representation of a document, which is required for data mining          ■	 The algorithms are capable of
techniques, does not maintain information that is potentially important to sentiment         discovering unimagined and
classification. For example, the vector representation does not capture when terms
                                                                                             complicated patterns that would
are close to one another in the document, if one term precedes another or any other
contextual cues. The order of terms in a phrase can significantly affect meaning.            be beyond what a human could
Consider the phrases:                                                                        anticipate.

   “… night for a great movie”

   and

   “… great night for a movie”

These two phrases convey two different meanings; yet in a vector representation, the
phrases have an identical representation.

In addition, most predictive models provide little feedback to the user as to precisely
why a particular document was classified as having positive or negative polarity. So
when you attempt to understand what positive things people said in a particular
document, you frequently have to read the entire document to discover the answer.

As a final drawback, forming the training and validation is an essential component
of learning a predictive model, but it can be very time-consuming and challenging.
A rating needs to be provided for every document, and if there are attributes of
documents that you wish to use to measure sentiment, you will need to provide a
rating for each of these as well. Another complication is that two different reviewers
frequently assign two different sentiment ratings to the same document. This can
introduce unexpected errors in building and measuring the performance of your
model.



Natural Language Processing Approach

Natural language processing (NLP) is a field of artificial intelligence that deals with
automatically extracting meaning from natural language text. As discussed in the
introduction of this paper, it’s very challenging to get machines to understand text at
the same levels as humans. Doing this with the specific goal of extracting sentiment
is even more challenging. For example, consider the text snippet below:
                                                                                                                               5
COMBINING KNOWLEDGE AND DATA MINING TO UNDERSTAND SENTIMENT




“… with that out of the way, let me say this – this film is bad. This film is really, really
bad. Yet somehow, it is strangely enjoyable. …”

If interpreted by a human, the above text would imply a positive sentiment from the
author toward the movie. However, it can be very challenging to get the same output
from a computer because of the dense presence of the strongly negative words.

The rule-based NLP methods use certain entities and syntactic patterns in the text
to understand its meaning. SAS Sentiment Analysis provides all the tools needed
for this kind of disambiguation. You can use a combination of language dictionaries,
linguistic constructs like parts of speech, and noun phrases along with a range of
operators.

The operators fall into a few different categories as shown below:

•	 Boolean operators. Used to include or exclude different entities (e.g., AND, OR,
   NOT).

•	 Frequency operators. Used to measure the specified number of occurrences of
   certain entities, (e.g., MIN, MINOC, MAXOC).

•	 Context operators. Used to measure the context within which certain entities
   occur in the text (e.g., DIST, START, END, SENT, PARA).

•	 Sequence operators. Used to look for the entities in a specific sequence (e.g.,
   ORD, ORDDIST).

The process of developing rule-based models for sentiment analysis involves a few
different steps. These are explained below.



Step one: taxonomy identification

The initial step in the NLP approach is taxonomy identification. Taxonomy here
refers to a simple, two-level hierarchy where you specify the different objects and
attributes for which you want to extract sentiment. You can either use a predefined
taxonomy or you can use text mining to learn the most prominent objects and their
attributes in the corpus and then make them part of your taxonomy. Figure 1 shows
the predefined taxonomy that we used for extracting sentiment from the movie review
data. The discovery-based text mining methods are discussed later in this paper.




6
COMBINING KNOWLEDGE AND DATA MINING TO UNDERSTAND SENTIMENT




Figure 1: Taxonomy for movie reviews.



Step two: defining objects and attributes

The next step is to define the objects and their attributes. A basic approach to
defining these is to identify their synonyms or the different ways they may be referred
to in the text. Figure 2 shows an example.




Figure 2: Example of defining the visuals attribute.


While this approach captures many cases, in other situations the attribute might be
referred to using its co-referent. Consider the example below:

“The movie starred Jennifer Aniston. The plot of the movie was very interesting.
Aniston’s performance was commendable. She looks adorable.”




                                                                                                                 7
COMBINING KNOWLEDGE AND DATA MINING TO UNDERSTAND SENTIMENT




Here the name of the actress was mentioned only in the first sentence. In the
subsequent sentences, the actress was referred to using her last name and
a pronoun. These three entities are said to be co-referent and the process of
identifying them is called co-reference resolution. The rule-based methods allow you
to write rules to handle such cases.


Step three: defining polarity

Polarity is determined by associating predefined positive or negative terms or
expressions with the attributes that have been identified. Dictionaries of subjective
expressions are available and can be customized to specific domains (see Figure 3).




Figure 3: Example of a generic dictionary of positive keywords.

You could also define multiple classes of subjective expressions to denote different
levels of subjectivity.

“incredible,” “stunning” ➔ strong positive
“hate,” “disgust” ➔ strong negative
Assigning the appropriate polarity requires that negations are handled properly. To do
this, you can use a combination of part-of-speech tags and dictionaries as shown in
Figures 4 and 5.




8
COMBINING KNOWLEDGE AND DATA MINING TO UNDERSTAND SENTIMENT




Figure 4: Example of a class of negated adjectives.


In Figure 4, “NegClass” is a dictionary of expressions that denote a negation. For
example, “not,” “will not,” “have not,” etc. and “:Adv,” “:A” and “:V” represent any
adverb, adjective and verb respectively.




Figure 5: Example of a negation rule.


Finally, to extract the sentiment at attribute level, you can write context-based rules
as shown in Figure 6, where we used a combination of operators.




                                                                                                               9
COMBINING KNOWLEDGE AND DATA MINING TO UNDERSTAND SENTIMENT




                                                                                                          ■	 The major advantage of rule-based
                                                                                                             methods is the amount of control
                                                                                                             they give rule developers over how
                                                                                                             the analysis will be performed.




Figure 6: Example of an attribute-level sentiment rule.



Benefits of the NLP approach

The major advantage of rule-based methods is the amount of control they give
rule developers over how the analysis will be performed. Developers can use their
knowledge of the domain and the language within it to develop rules that have high
precision.

Unlike statistical analysis, the results of rule-based analysis are easily interpretable.
This is very important for real-life applications where the analysts need to know
exactly why a document or an attribute within a document was tagged as positive or
negative. In other words, analysts need to know exactly what sentences, keywords
or context within the document triggered the positive or negative sentiment. Figure 7
shows an example of this.

 I think they did a fantastic job this movie. I read the book, I loved the book, and I loved the movie!
 My only qualm was Javier bardem playing a Brazilian when he is SPANISH! Julia Roberts was
 perfect and beautfiul. Wonderful casting job (with the exception of Bardem)! Good acting. Some
 parters were a tad confusing for those who haven’t read the book. But I took my mom, who didn’t
 read the book, and she really liked it. br/
 br/
 It’s not just some sappy chick flick. It’s a powerful journey about finding yourself hen you let
 yourself GO!br/
 br/
 Empowering.br/
 Perfection. = EAT PRAY LOVE!br/
 Lovely

Figure 7: Example showing different entities that were used for rule-based analysis.


Rule-based methods are completely unsupervised; that is, they do not require any
training data. This is a big advantage in real-life applications where training data is
scarce. The non-availability of training data is more pronounced when it comes to
granular sentiment analysis (sentiment derived at the objects and attributes level).

10
COMBINING KNOWLEDGE AND DATA MINING TO UNDERSTAND SENTIMENT




Another advantage of rule-based methods is their ability to refine the rules over time
based on the feedback from analysts or subject-matter experts. The more time the
rule developer spends on refining the rules, the better the results. Language evolves
over time and people start using newer terms to express their sentiments. This is
especially true for social media, where the language used changes all the time. In
such cases, rule-based methods give you the flexibility needed to adjust your models
accordingly.



Drawback of the NLP approach

The disadvantage of rule-based methods is that they require a lot of human
involvement in developing the rules. These methods completely rely on the domain
knowledge of rule developers. It might take a few weeks to come up with a strong
rule-based model for a new domain. However, once you have a strong rule-based
model for a domain, you can reuse that model with some minor modifications for
different applications within the domain.

The importance of validation data is often underestimated while developing these
models. The rules being written must be generic enough so that they are capable
of handling all possible cases. Inexperienced rule developers tend to over-fit their
rules to the sample data they are working with. Such rules might not work well when
tested on different data sets. So, rule developers must make sure they validate the
rules on different data sets before considering a model ready to deploy.




The Best of Both Worlds

As we discussed earlier, data mining learns relevant patterns from a numerical
representation of the entire collection, and the patterns discovered are derived by
analyzing the collection as a whole. The rule builder, on the other hand, relies only
                                                                                         ■	 Because they approach the problem
on personal experience and knowledge to formulate rules that will be useful for
sentiment analysis.                                                                         so differently, data mining and rule-
                                                                                            based systems can complement one
Because they approach the problem so differently, data mining and rule-based                another.
systems can complement one another. They can do this in two ways. First,
unsupervised data mining can be used as a tool for the rule builder; and second, the
supervised data mining model can be combined with the rule-based model in such
a way that the strengths of each model are combined, and any possible mistakes
made by one model can be corrected by the other.



Data Mining of the Text for the Rule Builder

The challenge of the rule builder is to devise and formulate rules that capture the
sentiment contained in the collection. To do this, the rule builder must have some
understanding of the content of the documents that are being categorized. For


                                                                                                                                11
COMBINING KNOWLEDGE AND DATA MINING TO UNDERSTAND SENTIMENT




instance, in our movie review collection, are all the reviews about a specific movie or
are they about a specific genre of movies? If we know, we can save time by writing
rules that are only directed to a particular movie or genre. On the other hand, if the
reviews are about movies from many different genres, we must consider how that
knowledge affects the rules we write. Otherwise, we might not capture the sentiment
accurately.

For instance, when discussing a horror movie, the statement
     “The scariest thing I have ever seen”

is typically an indicator that the reviewer enjoyed the movie. But it could be a negative
indicator if the reviewer was discussing a children’s movie.

Unsupervised text mining allows you to quickly get a handle on the collection you
are examining without spending time reading many individual documents. SAS
Text Miner provides a node both for generating topics within a document and for
clustering the documents. These approaches are useful for understanding the
collection and for revealing significant aspects of the data. Table 1 shows that our
collection is quite varied.


 ID      Descriptive Terms                                                 Freq.   Pct.
 1       + horror, + killer, + scary, + scream, horror, + reason, last,    155     8%
         minutes
 2       + animation, adults, animated, disney, voice, children,           73      4%
         kids, + feature
 3       coen, fargo, money, wife, different, pretty, sequences,           37      2%
         guy
 4       + war, world, life, love, + sense, + fight, right, + father       267     13%
 5       + comedy, jokes, + funny, funny, fun, script, back, cast          213     11%
 6       earth, effects, special effects, special, star, + action, +       276     14%
         people, interesting
 7       + action, + fight, sequences, bad, fun, guy, special ef-          177     9%
         fects, acting
 8       + comedy, mother, + father, woman, funny, love, + family,         400     20%
         high
 9       performances, mother, performance, love, down, + point, 117               6%
         last, different
 10      + thriller, case, + action, + killer, wife, + job, performance,   285     14%
         script

Table 1: Ten clusters from the Pang and Lee data.


The clusters reveal several prominent categories of movies, reminding rule builders
that they need to consider how people express sentiment in the following types of
movies:

•	 Horror movies.

•	 Animation and children’s movies.


12
COMBINING KNOWLEDGE AND DATA MINING TO UNDERSTAND SENTIMENT




•	 Comedies.

•	 Science fiction movies.

•	 Action movies.

•	 Thrillers.

If you, as the rule builder, had not been thinking of how people express their opinions
about movies from these different categories, it could be easy to incorrectly capture
the sentiment contained in them.

Further discovery can be done to capture the sentiment of individual attributes
within the document. For instance, since the SAS Text Miner filter node allows you
to subset documents that contain the visual attribute synonyms displayed in Figure
2, you can subset the collection accordingly. In Figure 8, the search expression has
been set to include only those documents that contain at least one of the visual
attribute synonyms used in the rule building. The special character “*” implies a
wildcard search is to occur, and the quoted input means that only the exact phrase,
“special effects,” should match. The filter node can be followed with a clustering
or topic node, and then any analysis of this subsetted collection provides you with
some potential new ideas for rules.




Figure 8: A search expression to retrieve documents concerned with the visual sentiment
attribute.


This particular subsetted collection revealed discussions around costumes and
costume designs, as well as the reviewer’s reaction to the theater setting. Neither of
these were aspects of visual sentiment that we had considered prior to discovering
these topics.

At an even finer level, the reports of important terms and phrases (particularly in
relation to one another in the concept-linking diagram) provide sentence-level
ideas for your rule generation. The diagram in Figure 9 was made in the process of
exploring reviewers’ comments on their theater experience. The diagram suggests
that the sentiment regarding the music or sound in the movie might be another
attribute that could be added to the taxonomy and examined.




                                                                                                          13
COMBINING KNOWLEDGE AND DATA MINING TO UNDERSTAND SENTIMENT




Figure 9: A concept link diagram of “music” and “loud.”



Hybrid Approaches
                                                                                         ■	 Hybrid approaches involve using
Hybrid approaches involve using a rule-based approach and a data mining approach
                                                                                            a rule-based approach and a data
in combination. In the next sections we will describe two alternative methods. The
                                                                                            mining approach in combination.
first method can be used to supplement the features from the traditional data mining
model by adding features derived from the linguistic rules that are triggered. The
second method shows how to use an ensemble of the results of the two distinct
approaches to improve the prediction.



Polarity scores as additional features

One advantage of SAS Text Miner is that it allows additional features associated with
the document to be combined with the term features or with the SVD dimensions
before training the predictive model. Polarity scores are simply a summary score
based on a function of the number of times the positive and the negative rules trigger
in a document, or in an attribute of a document. These values can be obtained from
SAS Sentiment Analysis.




14
COMBINING KNOWLEDGE AND DATA MINING TO UNDERSTAND SENTIMENT




Once obtained, the logistic function can be applied to the ratio of the weighted
positive and negative counts so that a document’s polarity score will be between 0
and 1, inclusively. A document with more positive sentiment weight will be assigned
a score closer to 1, and a document that tends to have more negative sentiment
scores closer to 0. This score is then used in combination with the SVD dimensions.

When the document has several attributes that receive a polarity score, each of
these scores can be added as features to the text mining model. The hybrid model
within SAS Sentiment Analysis software also makes use of this approach.



Stacked models

Another hybrid approach is to stack the models. This means that the rule-based and
the data mining models are run separately in the first stage; but a second, predictive
model is “stacked” after these two models so that the output of the two (a predictive
probability for each document from each model) becomes the input into a second-
stage model.

Stacking is an ensemble method that can improve accuracy if the two first-stage
models differ in their predictions. Stacking allows for the two models to potentially
correct one another where they differ.

In Figure 10, SAS Text Miner is used to build one sentiment model, while the model
import node brings in a model from SAS Sentiment Analysis. The output of the
two models is massaged with SAS code, and then goes into the second stage
regression for a final prediction.




Figure 10: Stacking models.




                                                                                                          15
COMBINING KNOWLEDGE AND DATA MINING TO UNDERSTAND SENTIMENT




Results

We experimented with the sentiment analysis approaches presented in this paper
using the movie review data sets. The Yahoo movie data set was used to analyze
sentiment at the attribute level, and the Pang and Lee data set was used for the
overall sentiment predictions.



Attribute-Level Results

Table 2 shows the results for the attribute-level sentiment analysis on the Yahoo
movie data. The Yahoo data had explicit user ratings for the different attributes,
and we compared those ratings with the predictions made by the rule-based
model developed with SAS Sentiment Analysis. We spent three days on the rule-
development process. The Yahoo data included some reviews where a user rating
was available for a particular attribute, but the attribute itself was not discussed
in the text of the review. We did not include such reviews in the evaluation of the
attribute. We also did not include the general attribute because no user ratings were
available for it. A user rating of C+ or higher was considered positive, and C- or
lower was considered negative.


                        Num Reviews         Misclass Rate
 Story                  972                 .23
 Cast                   1272                .14
 Direction              243                 .17
 Visuals                459                 .12
 Aggregate              2946                .18


Table 2: Attribute-level results.


With just three days of effort on rule development, we were able to achieve an
overall precision of 82 percent at the attribute level. The misclassification rate for the
story attribute was relatively higher than the other attributes. That is an indication to
the rule developer to further refine the rules for that attribute. Rule refinement is an
ongoing process, and precision can improve over a period of time.



Overall Results

Table 3 shows the results of our comparisons of the Pang and Lee data. For the
data mining approach, 1,800 random movie reviews were used for training a model,




16
COMBINING KNOWLEDGE AND DATA MINING TO UNDERSTAND SENTIMENT




and 200 reviews were held out to be scored. This process was repeated four times,
and the misclassification scores were averaged. For each run, the same set of 200
reviews was analyzed in SAS Sentiment Analysis so that the comparisons were
made on the same set of data.


           Approach                     Misclass Rate
 1         SAS Text Miner               .144
 2         SAS Sentiment Analysis       .252
           Attribute-Level Rules
 3         Add Polarity Scores as       .132
           Features in SAS Text
           Miner
 4         Blended                      .139

Table 3: Overall sentiment misclassification results.


The results obtained with the text mining model were achieved by using a category-
specific weighting and by having enough training data. The SAS Sentiment Analysis
overall sentiment model was derived from the rules for the individual attributes.
Under these conditions, the rule-based model did not perform as well as the SAS
Text Miner model. However, combining the models – by using the polarity scores as
features in the SAS Text Miner model, or by blending the two models – did improve
results.




Other Applications


Importing Models

SAS Sentiment Analysis can build a hybrid model using rules combined with a Naïve
Bayes algorithm. However, to leverage all the predictive analysis advantages of
SAS® Enterprise Miner™ software, the models from SAS Sentiment Analysis must
be imported into SAS Enterprise Miner. This can be done easily by using the SAS
Enterprise Miner model import node. Once the output of SAS Sentiment Analysis
is imported, models can be combined in various ways and then compared with
the model assessment node. Figure 11 shows the receiver operator curve (ROC)
plot from the model assessment node after a SAS Sentiment Analysis model was
imported.




                                                                                                             17
COMBINING KNOWLEDGE AND DATA MINING TO UNDERSTAND SENTIMENT




Figure 11: ROC chart of SAS Enterprise Miner models with an imported SAS Sentiment       ■	 One approach to creating training
Analysis model (denoted by model import). In this graph, “TM” denotes SAS Text Miner
and “RuleIn” refers to using SAS Sentiment Analysis rules in conjunction with               data is to use very precise rules that
SAS Text Miner.
                                                                                            will make a sentiment classification
                                                                                            only on the documents you are most
Creating Training Data	                                                                     sure about.

As discussed earlier, training data that has the “answers” is an essential part of a
text mining approach. It is necessary to build a predictive model that can make
accurate sentiment predictions. It is also important for a rule-based system because
it validates how your rules are doing. The feedback lets you know if you need to
add or remove specific rules, or if you must refine certain rules. Unfortunately,
training data is not always available, and creating this data can be an expensive time
commitment.

One approach to creating training data is to use very precise rules that will make a
sentiment classification only on the documents you are most sure about. At the risk
of not assigning a sentiment category to many of the documents, you do assign
sentiment to a small subset of documents.




18
COMBINING KNOWLEDGE AND DATA MINING TO UNDERSTAND SENTIMENT




We applied this approach to the movie review data by choosing rules that captured
complete phrases that seemed, in our opinion, to indicate the overall sentiment. For
instance, we included a set of rules that would trigger a positive score for a review
that contained phrases like:

   “I thoroughly enjoyed this movie.” or “I totally loved the film.”

When these types of phrases occurred in the document, the polarity was rated
positive. Similarly, corresponding precise rules were added for negative polarity.

When we applied this approach to our movie review collection, 103 of the 2,000
documents triggered our rules. (While 103 documents is too small for an effective set
of training data, with a larger pool of 20,000 reviews we would have likely obtained
1,000 documents in the training set.) We still confirmed the polarity by reviewing
each of the 103 documents. Since SAS Sentiment Analysis highlights the rules in
context, it was quick work to check the 103 documents to ensure that it was an
appropriate trigger. Based on our manual review, it appeared that eight of the 103
documents were incorrect, so we corrected the polarity for those so that our training
data would be free of errors.



Other Capabilities of SAS® Enterprise Miner™

This paper has primarily focused on combing the rule-based capabilities of SAS
Sentiment Analysis with the text mining capabilities of SAS Text Miner, in conjunction
with the predictive models available in SAS Enterprise Miner. There is much more
functionality in SAS Enterprise Miner that can be used to help you understand
the sentiment contained in a collection and to build on the rule models you have
developed. Such functionality as sequences and associations, decision trees, SOM-
Kohonen self-organizing maps, variable clustering, transformations and sampling,
and statistical exploration have all been used in various contexts to supplement
textual understanding.




Conclusions

Independently, both the domain knowledge and the data mining approaches to
sentiment analysis have their strengths and weaknesses; but hopefully you will not
be forced to choose between using one or the other for your analysis. In this paper,
we have shown that the two approaches complement one another. So, while the
NLP approach leverages the rule builder’s domain knowledge, text mining can also
be used by that person to improve, clarify or correct how that knowledge relates to
the particular collection being analyzed. Text mining reveals important patterns in the
specific collection that assist the rule builder.




                                                                                                             19
COMBINING KNOWLEDGE AND DATA MINING TO UNDERSTAND SENTIMENT




On the other hand, the text mining approach allows you to quickly build a sentiment
classifier with term frequencies alone. But without any semantic or syntactic
indicators, mistakes that would seem elementary to a human can easily occur. We
have shown that these linguistic indicators can be captured by a rule-base system
and then leveraged in the statistical classifier as additional features, or as a blended
model. The end result is a model that is better than either one individually.




References
1
 Albright, Russ. Taming Text with the SVD. January 2004. SAS: Cary, NC. Web:
http://ftp.sas.com/techsup/download/EMiner/TamingTextwiththeSVD.pdf.

2
 Pang et al. “Thumbs Up? Sentiment Classification Using Machine Learning
Techniques.” Proceedings of the Conference on Empirical Methods in Natural
Language Processing (EMNLP). Conference on Empirical Methods in Natural
Language Processing. 2002. 79-86.




        The authors thank James Cox and Janardhana Punuru from the SAS Text
        Analytics Research and Development team for their helpful comments
        and suggestions. They also thank Fiona McNeill from SAS Marketing for
        encouraging them to work on this paper and providing valuable feedback.




20
SAS Institute Inc. World Headquarters                                   +1 919 677 8000
To contact your local SAS office, please visit: www.sas.com/offices
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA
and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies.
Copyright © 2011, SAS Institute Inc. All rights reserved. 105008_S59083.0211

Más contenido relacionado

La actualidad más candente

Social Media Dashboarding by Scott Wilder and semphonic
Social Media Dashboarding by Scott Wilder and semphonicSocial Media Dashboarding by Scott Wilder and semphonic
Social Media Dashboarding by Scott Wilder and semphonicEdelman Digital
 
Social Media Dashboarding (reporting)
Social Media Dashboarding (reporting)Social Media Dashboarding (reporting)
Social Media Dashboarding (reporting)Scott K. Wilder
 
from-big-data-comes-small-worlds-messineo.PDF
from-big-data-comes-small-worlds-messineo.PDFfrom-big-data-comes-small-worlds-messineo.PDF
from-big-data-comes-small-worlds-messineo.PDFDavid Messineo
 
Key Marketing Trends For 2011
Key Marketing Trends For 2011Key Marketing Trends For 2011
Key Marketing Trends For 2011Julie Benlolo
 
What are you measuring - 3 approaches to data-driven marketing
What are you measuring - 3 approaches to data-driven marketingWhat are you measuring - 3 approaches to data-driven marketing
What are you measuring - 3 approaches to data-driven marketingJulie Doyle
 
Social Network Analysis - Twitter
Social Network Analysis - TwitterSocial Network Analysis - Twitter
Social Network Analysis - TwitterSocial Figures
 
[Report] Shiny Object or Digital Intelligence Hub? Evolution of the Enterpris...
[Report] Shiny Object or Digital Intelligence Hub? Evolution of the Enterpris...[Report] Shiny Object or Digital Intelligence Hub? Evolution of the Enterpris...
[Report] Shiny Object or Digital Intelligence Hub? Evolution of the Enterpris...Altimeter, a Prophet Company
 
Social media measurement tools group 1
Social media measurement tools   group 1Social media measurement tools   group 1
Social media measurement tools group 1Sahil Surana
 
7 Essentials of Health Economic Communication | Market Access Services
7 Essentials of Health Economic Communication | Market Access Services7 Essentials of Health Economic Communication | Market Access Services
7 Essentials of Health Economic Communication | Market Access ServicesCovance
 
Knowledge modeling of on line value management
Knowledge modeling of on line value managementKnowledge modeling of on line value management
Knowledge modeling of on line value managementSTIinnsbruck
 
IRJET- Review on Marketing Analysis in Social Media
IRJET-  	  Review on Marketing Analysis in Social MediaIRJET-  	  Review on Marketing Analysis in Social Media
IRJET- Review on Marketing Analysis in Social MediaIRJET Journal
 
Pm360 article results the future of pharma marketing
Pm360 article   results  the future of pharma marketingPm360 article   results  the future of pharma marketing
Pm360 article results the future of pharma marketingJoanne Toran McHugh
 
Life Sciences: Leveraging Customer Data for Commercial Success
Life Sciences: Leveraging Customer Data for Commercial SuccessLife Sciences: Leveraging Customer Data for Commercial Success
Life Sciences: Leveraging Customer Data for Commercial SuccessCognizant
 
Nielsen Measuring Social Media
Nielsen Measuring Social MediaNielsen Measuring Social Media
Nielsen Measuring Social MediaJulie Benlolo
 
DATACTIF SoNetA. BIG DATA ANALYTIS
DATACTIF SoNetA. BIG DATA ANALYTISDATACTIF SoNetA. BIG DATA ANALYTIS
DATACTIF SoNetA. BIG DATA ANALYTISGregory Philippatos
 
Alterians 7th Annual Survey Results
Alterians 7th Annual Survey ResultsAlterians 7th Annual Survey Results
Alterians 7th Annual Survey ResultsAlterian
 
Using Data-Driven Insights to Plan & Execute Campaigns
Using Data-Driven Insights to Plan & Execute CampaignsUsing Data-Driven Insights to Plan & Execute Campaigns
Using Data-Driven Insights to Plan & Execute CampaignsSHIFT Communications
 
Omniture Workbook Measuring Social Media Impact
Omniture Workbook Measuring Social Media ImpactOmniture Workbook Measuring Social Media Impact
Omniture Workbook Measuring Social Media ImpactRalph Paglia
 

La actualidad más candente (20)

Social Media Dashboarding by Scott Wilder and semphonic
Social Media Dashboarding by Scott Wilder and semphonicSocial Media Dashboarding by Scott Wilder and semphonic
Social Media Dashboarding by Scott Wilder and semphonic
 
Social Media Dashboarding (reporting)
Social Media Dashboarding (reporting)Social Media Dashboarding (reporting)
Social Media Dashboarding (reporting)
 
B2B data best practice guide
B2B data best practice guideB2B data best practice guide
B2B data best practice guide
 
from-big-data-comes-small-worlds-messineo.PDF
from-big-data-comes-small-worlds-messineo.PDFfrom-big-data-comes-small-worlds-messineo.PDF
from-big-data-comes-small-worlds-messineo.PDF
 
Key Marketing Trends For 2011
Key Marketing Trends For 2011Key Marketing Trends For 2011
Key Marketing Trends For 2011
 
What are you measuring - 3 approaches to data-driven marketing
What are you measuring - 3 approaches to data-driven marketingWhat are you measuring - 3 approaches to data-driven marketing
What are you measuring - 3 approaches to data-driven marketing
 
Social Network Analysis - Twitter
Social Network Analysis - TwitterSocial Network Analysis - Twitter
Social Network Analysis - Twitter
 
[Report] Shiny Object or Digital Intelligence Hub? Evolution of the Enterpris...
[Report] Shiny Object or Digital Intelligence Hub? Evolution of the Enterpris...[Report] Shiny Object or Digital Intelligence Hub? Evolution of the Enterpris...
[Report] Shiny Object or Digital Intelligence Hub? Evolution of the Enterpris...
 
Social media measurement tools group 1
Social media measurement tools   group 1Social media measurement tools   group 1
Social media measurement tools group 1
 
7 Essentials of Health Economic Communication | Market Access Services
7 Essentials of Health Economic Communication | Market Access Services7 Essentials of Health Economic Communication | Market Access Services
7 Essentials of Health Economic Communication | Market Access Services
 
Knowledge modeling of on line value management
Knowledge modeling of on line value managementKnowledge modeling of on line value management
Knowledge modeling of on line value management
 
IRJET- Review on Marketing Analysis in Social Media
IRJET-  	  Review on Marketing Analysis in Social MediaIRJET-  	  Review on Marketing Analysis in Social Media
IRJET- Review on Marketing Analysis in Social Media
 
Pm360 article results the future of pharma marketing
Pm360 article   results  the future of pharma marketingPm360 article   results  the future of pharma marketing
Pm360 article results the future of pharma marketing
 
Life Sciences: Leveraging Customer Data for Commercial Success
Life Sciences: Leveraging Customer Data for Commercial SuccessLife Sciences: Leveraging Customer Data for Commercial Success
Life Sciences: Leveraging Customer Data for Commercial Success
 
Nielsen Measuring Social Media
Nielsen Measuring Social MediaNielsen Measuring Social Media
Nielsen Measuring Social Media
 
DATACTIF SoNetA. BIG DATA ANALYTIS
DATACTIF SoNetA. BIG DATA ANALYTISDATACTIF SoNetA. BIG DATA ANALYTIS
DATACTIF SoNetA. BIG DATA ANALYTIS
 
Alterians 7th Annual Survey Results
Alterians 7th Annual Survey ResultsAlterians 7th Annual Survey Results
Alterians 7th Annual Survey Results
 
Soneta. Social Network Analyzer
Soneta. Social Network AnalyzerSoneta. Social Network Analyzer
Soneta. Social Network Analyzer
 
Using Data-Driven Insights to Plan & Execute Campaigns
Using Data-Driven Insights to Plan & Execute CampaignsUsing Data-Driven Insights to Plan & Execute Campaigns
Using Data-Driven Insights to Plan & Execute Campaigns
 
Omniture Workbook Measuring Social Media Impact
Omniture Workbook Measuring Social Media ImpactOmniture Workbook Measuring Social Media Impact
Omniture Workbook Measuring Social Media Impact
 

Similar a Combining Knowledge and Data Mining to Understand Sentiment

Audit Scope and Process
Audit Scope and ProcessAudit Scope and Process
Audit Scope and ProcessDaniel McKean
 
Introduction to Sentiment Analysis
Introduction to Sentiment AnalysisIntroduction to Sentiment Analysis
Introduction to Sentiment AnalysisMakrand Patil
 
Book recommendation system using opinion mining technique
Book recommendation system using opinion mining techniqueBook recommendation system using opinion mining technique
Book recommendation system using opinion mining techniqueeSAT Journals
 
A comparison of Social Media Monitoring Tools. A white paper from FreshMinds ...
A comparison of Social Media Monitoring Tools. A white paper from FreshMinds ...A comparison of Social Media Monitoring Tools. A white paper from FreshMinds ...
A comparison of Social Media Monitoring Tools. A white paper from FreshMinds ...LiveXtension
 
Review on Opinion Targets and Opinion Words Extraction Techniques from Online...
Review on Opinion Targets and Opinion Words Extraction Techniques from Online...Review on Opinion Targets and Opinion Words Extraction Techniques from Online...
Review on Opinion Targets and Opinion Words Extraction Techniques from Online...IRJET Journal
 
OPINION MINING AND ANALYSIS: A SURVEY
OPINION MINING AND ANALYSIS: A SURVEYOPINION MINING AND ANALYSIS: A SURVEY
OPINION MINING AND ANALYSIS: A SURVEYijnlc
 
A NOVEL APPROACH FOR TWITTER SENTIMENT ANALYSIS USING HYBRID CLASSIFIER
A NOVEL APPROACH FOR TWITTER SENTIMENT ANALYSIS USING HYBRID CLASSIFIERA NOVEL APPROACH FOR TWITTER SENTIMENT ANALYSIS USING HYBRID CLASSIFIER
A NOVEL APPROACH FOR TWITTER SENTIMENT ANALYSIS USING HYBRID CLASSIFIERIRJET Journal
 
Building a Sentiment Analytics Solution Powered by Machine Learning- Impetus ...
Building a Sentiment Analytics Solution Powered by Machine Learning- Impetus ...Building a Sentiment Analytics Solution Powered by Machine Learning- Impetus ...
Building a Sentiment Analytics Solution Powered by Machine Learning- Impetus ...Impetus Technologies
 
Analyze mentions, opinions, and sentiments behind social media posts
Analyze mentions, opinions, and sentiments behind social media postsAnalyze mentions, opinions, and sentiments behind social media posts
Analyze mentions, opinions, and sentiments behind social media postsshreya sahani
 
How To Prepare A Survey Essay Example Topics An
How To Prepare A Survey Essay Example Topics AnHow To Prepare A Survey Essay Example Topics An
How To Prepare A Survey Essay Example Topics AnRebecca Buono
 
Sentiment Analysis of Product Reviews and Trustworthiness Evaluation using TRS
Sentiment Analysis of Product Reviews and Trustworthiness Evaluation using TRSSentiment Analysis of Product Reviews and Trustworthiness Evaluation using TRS
Sentiment Analysis of Product Reviews and Trustworthiness Evaluation using TRSIRJET Journal
 
Sentiment Analysis of Product Reviews and Trustworthiness Evaluation using TRS
Sentiment Analysis of Product Reviews and Trustworthiness Evaluation using TRSSentiment Analysis of Product Reviews and Trustworthiness Evaluation using TRS
Sentiment Analysis of Product Reviews and Trustworthiness Evaluation using TRSIRJET Journal
 
A Survey on Evaluating Sentiments by Using Artificial Neural Network
A Survey on Evaluating Sentiments by Using Artificial Neural NetworkA Survey on Evaluating Sentiments by Using Artificial Neural Network
A Survey on Evaluating Sentiments by Using Artificial Neural NetworkIRJET Journal
 
IRJET - Sentiment Analysis and Rumour Detection in Online Product Reviews
IRJET -  	  Sentiment Analysis and Rumour Detection in Online Product ReviewsIRJET -  	  Sentiment Analysis and Rumour Detection in Online Product Reviews
IRJET - Sentiment Analysis and Rumour Detection in Online Product ReviewsIRJET Journal
 
Sentiment Analysis on Twitter Dataset using R Language
Sentiment Analysis on Twitter Dataset using R LanguageSentiment Analysis on Twitter Dataset using R Language
Sentiment Analysis on Twitter Dataset using R Languageijtsrd
 
Types of Sentiment Analysis
Types of Sentiment AnalysisTypes of Sentiment Analysis
Types of Sentiment AnalysisRepustate
 
Dictionary Based Approach to Sentiment Analysis - A Review
Dictionary Based Approach to Sentiment Analysis - A ReviewDictionary Based Approach to Sentiment Analysis - A Review
Dictionary Based Approach to Sentiment Analysis - A ReviewINFOGAIN PUBLICATION
 
IRJET- Analyzing Sentiments in One Go
IRJET-  	  Analyzing Sentiments in One GoIRJET-  	  Analyzing Sentiments in One Go
IRJET- Analyzing Sentiments in One GoIRJET Journal
 

Similar a Combining Knowledge and Data Mining to Understand Sentiment (20)

Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Audit Scope and Process
Audit Scope and ProcessAudit Scope and Process
Audit Scope and Process
 
Introduction to Sentiment Analysis
Introduction to Sentiment AnalysisIntroduction to Sentiment Analysis
Introduction to Sentiment Analysis
 
Book recommendation system using opinion mining technique
Book recommendation system using opinion mining techniqueBook recommendation system using opinion mining technique
Book recommendation system using opinion mining technique
 
A comparison of Social Media Monitoring Tools. A white paper from FreshMinds ...
A comparison of Social Media Monitoring Tools. A white paper from FreshMinds ...A comparison of Social Media Monitoring Tools. A white paper from FreshMinds ...
A comparison of Social Media Monitoring Tools. A white paper from FreshMinds ...
 
Review on Opinion Targets and Opinion Words Extraction Techniques from Online...
Review on Opinion Targets and Opinion Words Extraction Techniques from Online...Review on Opinion Targets and Opinion Words Extraction Techniques from Online...
Review on Opinion Targets and Opinion Words Extraction Techniques from Online...
 
OPINION MINING AND ANALYSIS: A SURVEY
OPINION MINING AND ANALYSIS: A SURVEYOPINION MINING AND ANALYSIS: A SURVEY
OPINION MINING AND ANALYSIS: A SURVEY
 
A NOVEL APPROACH FOR TWITTER SENTIMENT ANALYSIS USING HYBRID CLASSIFIER
A NOVEL APPROACH FOR TWITTER SENTIMENT ANALYSIS USING HYBRID CLASSIFIERA NOVEL APPROACH FOR TWITTER SENTIMENT ANALYSIS USING HYBRID CLASSIFIER
A NOVEL APPROACH FOR TWITTER SENTIMENT ANALYSIS USING HYBRID CLASSIFIER
 
Social media Enabling Smart Decisions
Social media Enabling Smart DecisionsSocial media Enabling Smart Decisions
Social media Enabling Smart Decisions
 
Building a Sentiment Analytics Solution Powered by Machine Learning- Impetus ...
Building a Sentiment Analytics Solution Powered by Machine Learning- Impetus ...Building a Sentiment Analytics Solution Powered by Machine Learning- Impetus ...
Building a Sentiment Analytics Solution Powered by Machine Learning- Impetus ...
 
Analyze mentions, opinions, and sentiments behind social media posts
Analyze mentions, opinions, and sentiments behind social media postsAnalyze mentions, opinions, and sentiments behind social media posts
Analyze mentions, opinions, and sentiments behind social media posts
 
How To Prepare A Survey Essay Example Topics An
How To Prepare A Survey Essay Example Topics AnHow To Prepare A Survey Essay Example Topics An
How To Prepare A Survey Essay Example Topics An
 
Sentiment Analysis of Product Reviews and Trustworthiness Evaluation using TRS
Sentiment Analysis of Product Reviews and Trustworthiness Evaluation using TRSSentiment Analysis of Product Reviews and Trustworthiness Evaluation using TRS
Sentiment Analysis of Product Reviews and Trustworthiness Evaluation using TRS
 
Sentiment Analysis of Product Reviews and Trustworthiness Evaluation using TRS
Sentiment Analysis of Product Reviews and Trustworthiness Evaluation using TRSSentiment Analysis of Product Reviews and Trustworthiness Evaluation using TRS
Sentiment Analysis of Product Reviews and Trustworthiness Evaluation using TRS
 
A Survey on Evaluating Sentiments by Using Artificial Neural Network
A Survey on Evaluating Sentiments by Using Artificial Neural NetworkA Survey on Evaluating Sentiments by Using Artificial Neural Network
A Survey on Evaluating Sentiments by Using Artificial Neural Network
 
IRJET - Sentiment Analysis and Rumour Detection in Online Product Reviews
IRJET -  	  Sentiment Analysis and Rumour Detection in Online Product ReviewsIRJET -  	  Sentiment Analysis and Rumour Detection in Online Product Reviews
IRJET - Sentiment Analysis and Rumour Detection in Online Product Reviews
 
Sentiment Analysis on Twitter Dataset using R Language
Sentiment Analysis on Twitter Dataset using R LanguageSentiment Analysis on Twitter Dataset using R Language
Sentiment Analysis on Twitter Dataset using R Language
 
Types of Sentiment Analysis
Types of Sentiment AnalysisTypes of Sentiment Analysis
Types of Sentiment Analysis
 
Dictionary Based Approach to Sentiment Analysis - A Review
Dictionary Based Approach to Sentiment Analysis - A ReviewDictionary Based Approach to Sentiment Analysis - A Review
Dictionary Based Approach to Sentiment Analysis - A Review
 
IRJET- Analyzing Sentiments in One Go
IRJET-  	  Analyzing Sentiments in One GoIRJET-  	  Analyzing Sentiments in One Go
IRJET- Analyzing Sentiments in One Go
 

Más de C.Y Wong

Untitled Presentation
Untitled PresentationUntitled Presentation
Untitled PresentationC.Y Wong
 
Top 10 ways_to_improve
Top 10 ways_to_improveTop 10 ways_to_improve
Top 10 ways_to_improveC.Y Wong
 
B2B Content Marketing 2012
B2B Content Marketing 2012 B2B Content Marketing 2012
B2B Content Marketing 2012 C.Y Wong
 
Whitepaper New Content Marketer
Whitepaper New Content MarketerWhitepaper New Content Marketer
Whitepaper New Content MarketerC.Y Wong
 
Inbound Marketing Cheat Sheet
Inbound Marketing Cheat SheetInbound Marketing Cheat Sheet
Inbound Marketing Cheat SheetC.Y Wong
 
Getting Started with SEO
Getting Started with SEOGetting Started with SEO
Getting Started with SEOC.Y Wong
 
Getting Started with Marketing Measurement
Getting Started with Marketing MeasurementGetting Started with Marketing Measurement
Getting Started with Marketing MeasurementC.Y Wong
 
Facebook Advertising Performance
Facebook Advertising PerformanceFacebook Advertising Performance
Facebook Advertising PerformanceC.Y Wong
 
Community Manager - Insights 2013
Community Manager - Insights 2013Community Manager - Insights 2013
Community Manager - Insights 2013C.Y Wong
 
Best Practices from the Worlds Most Social Brands
Best Practices from the Worlds Most Social BrandsBest Practices from the Worlds Most Social Brands
Best Practices from the Worlds Most Social BrandsC.Y Wong
 
How to Use Twitter for Business
How to Use Twitter for BusinessHow to Use Twitter for Business
How to Use Twitter for BusinessC.Y Wong
 
10 Awesomely Provocative Stats for Your Agency's Pitch Deck
10 Awesomely Provocative Stats for Your Agency's Pitch Deck 10 Awesomely Provocative Stats for Your Agency's Pitch Deck
10 Awesomely Provocative Stats for Your Agency's Pitch Deck C.Y Wong
 
The Definitive Guide to Marketing Automation
The Definitive Guide to Marketing AutomationThe Definitive Guide to Marketing Automation
The Definitive Guide to Marketing AutomationC.Y Wong
 
Strong Success Guide - 13 Cross Channel Marketing Strategies for 2013
Strong Success Guide - 13 Cross Channel Marketing Strategies for 2013Strong Success Guide - 13 Cross Channel Marketing Strategies for 2013
Strong Success Guide - 13 Cross Channel Marketing Strategies for 2013C.Y Wong
 
Customer Lifecycle Engagement
Customer Lifecycle EngagementCustomer Lifecycle Engagement
Customer Lifecycle EngagementC.Y Wong
 
Digital Marketing Plan Template
Digital Marketing Plan TemplateDigital Marketing Plan Template
Digital Marketing Plan TemplateC.Y Wong
 
47 Amazing Blog Designs
47 Amazing Blog Designs 47 Amazing Blog Designs
47 Amazing Blog Designs C.Y Wong
 
The Rise of Digital Influence
The Rise of Digital InfluenceThe Rise of Digital Influence
The Rise of Digital InfluenceC.Y Wong
 
Project Management Methodology
Project Management MethodologyProject Management Methodology
Project Management MethodologyC.Y Wong
 
Creating a One to One Dialogue Through Social Interaction
Creating a One to One Dialogue Through Social InteractionCreating a One to One Dialogue Through Social Interaction
Creating a One to One Dialogue Through Social InteractionC.Y Wong
 

Más de C.Y Wong (20)

Untitled Presentation
Untitled PresentationUntitled Presentation
Untitled Presentation
 
Top 10 ways_to_improve
Top 10 ways_to_improveTop 10 ways_to_improve
Top 10 ways_to_improve
 
B2B Content Marketing 2012
B2B Content Marketing 2012 B2B Content Marketing 2012
B2B Content Marketing 2012
 
Whitepaper New Content Marketer
Whitepaper New Content MarketerWhitepaper New Content Marketer
Whitepaper New Content Marketer
 
Inbound Marketing Cheat Sheet
Inbound Marketing Cheat SheetInbound Marketing Cheat Sheet
Inbound Marketing Cheat Sheet
 
Getting Started with SEO
Getting Started with SEOGetting Started with SEO
Getting Started with SEO
 
Getting Started with Marketing Measurement
Getting Started with Marketing MeasurementGetting Started with Marketing Measurement
Getting Started with Marketing Measurement
 
Facebook Advertising Performance
Facebook Advertising PerformanceFacebook Advertising Performance
Facebook Advertising Performance
 
Community Manager - Insights 2013
Community Manager - Insights 2013Community Manager - Insights 2013
Community Manager - Insights 2013
 
Best Practices from the Worlds Most Social Brands
Best Practices from the Worlds Most Social BrandsBest Practices from the Worlds Most Social Brands
Best Practices from the Worlds Most Social Brands
 
How to Use Twitter for Business
How to Use Twitter for BusinessHow to Use Twitter for Business
How to Use Twitter for Business
 
10 Awesomely Provocative Stats for Your Agency's Pitch Deck
10 Awesomely Provocative Stats for Your Agency's Pitch Deck 10 Awesomely Provocative Stats for Your Agency's Pitch Deck
10 Awesomely Provocative Stats for Your Agency's Pitch Deck
 
The Definitive Guide to Marketing Automation
The Definitive Guide to Marketing AutomationThe Definitive Guide to Marketing Automation
The Definitive Guide to Marketing Automation
 
Strong Success Guide - 13 Cross Channel Marketing Strategies for 2013
Strong Success Guide - 13 Cross Channel Marketing Strategies for 2013Strong Success Guide - 13 Cross Channel Marketing Strategies for 2013
Strong Success Guide - 13 Cross Channel Marketing Strategies for 2013
 
Customer Lifecycle Engagement
Customer Lifecycle EngagementCustomer Lifecycle Engagement
Customer Lifecycle Engagement
 
Digital Marketing Plan Template
Digital Marketing Plan TemplateDigital Marketing Plan Template
Digital Marketing Plan Template
 
47 Amazing Blog Designs
47 Amazing Blog Designs 47 Amazing Blog Designs
47 Amazing Blog Designs
 
The Rise of Digital Influence
The Rise of Digital InfluenceThe Rise of Digital Influence
The Rise of Digital Influence
 
Project Management Methodology
Project Management MethodologyProject Management Methodology
Project Management Methodology
 
Creating a One to One Dialogue Through Social Interaction
Creating a One to One Dialogue Through Social InteractionCreating a One to One Dialogue Through Social Interaction
Creating a One to One Dialogue Through Social Interaction
 

Último

CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxmanuelaromero2013
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpinRaunakKeshri1
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesFatimaKhan178732
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 

Último (20)

CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptx
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpin
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Staff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSDStaff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSD
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and Actinides
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 

Combining Knowledge and Data Mining to Understand Sentiment

  • 1. WHITE PAPER Combining Knowledge and Data Mining to Understand Sentiment – A Practical Assessment of Approaches
  • 2. COMBINING KNOWLEDGE AND DATA MINING TO UNDERSTAND SENTIMENT Table of Contents Abstract............................................................................................................1 Introduction......................................................................................................1 The Elements of Sentiment Analysis...............................................................1 What Is Sentiment Analysis?........................................................................1 When Is It Relevant?.....................................................................................2 Elements of Sentiment Analysis...................................................................2 Sentiment Analysis Methods...........................................................................3 The Data.......................................................................................................3 Data Mining Approach..................................................................................4 Benefits of the data mining approach...............................................................5 Drawback of the data mining approach............................................................5 Natural Language Processing Approach.......................................................5 Step one: taxonomy identification....................................................................6 Step two: defining objects and attributes.........................................................7 Step three: defining polarity..............................................................................8 Benefits of the NLP approach........................................................................10 Drawback of the NLP approach.....................................................................11 The Best of Both Worlds.................................................................................11 Data Mining of the Text for the Rule Builder...............................................11 Hybrid Approaches......................................................................................14 Polarity scores as additional features..............................................................14 Stacked models.............................................................................................15 Results ...........................................................................................................16 Attribute-Level Results...............................................................................16 Overall Results............................................................................................16 Other Applications..........................................................................................17 Importing Models .......................................................................................17 Creating Training Data................................................................................18 Other Capabilities of SAS® Enterprise Miner™............................................19 Conclusions....................................................................................................19 References......................................................................................................20 i
  • 3. COMBINING KNOWLEDGE AND DATA MINING TO UNDERSTAND SENTIMENT Russell Albright is a Research Statistician Developer at SAS and has been working on SAS® Text Miner algorithms since its initial release more than 10 years ago. He holds a master’s and a doctorate in applied math from Clemson University. Albright has expertise in numerical matrix methods and Bayesian networks, and he has experience applying text mining to many Web-based sources, including Twitter, Yahoo and PubMed. Praveen Lakkaraju is a Software Developer at SAS and is a member of the SAS Text Analytics research and development team. His areas of experience include sentiment analysis, information retrieval and content categorization. He was instrumental in the launch of the SAS Social Media Analytics solution, and is still actively involved in its development. Lakkaraju holds a master’s in computer science from the University of Kansas, where he specialized in the field of natural language processing. ii
  • 4. COMBINING KNOWLEDGE AND DATA MINING TO UNDERSTAND SENTIMENT Abstract An important application of text analytics is to automatically characterize the sentiment of documents in a variety of domains, whether it is positive, negative or neither. In this paper we explore the benefits of combining domain-specific linguistic rules with data mining methods to improve both the effectiveness of your models and the efficiency of the model builder. Introduction Our world has changed drastically in the last 10 years. An individual’s opinions are no longer shared only with his or her immediate family and friends, but instead are capable of influencing the decisions of thousands or even millions of people the individual has never even met. The Internet has given the individual a platform to broadcast grievances and recommendations that can reach across the world. And the existence of social networks gives these opinions the potential to snowball into a viral frenzy that can make your company’s products or services a worldwide boon or a global catastrophe in just a matter of days. The savvy marketer monitors and evaluates relevant Web content continually to understand consumer sentiment toward products or services from his company – and toward his competitors. This attention to Web content allows the company to respond quickly to customer opinion. The sheer volume of references related to your company’s products or services makes automating this task essential. Sources such as blogs, product reviews, forums and news articles can all be monitored, scored for relevance against your topics of interest, and then classified according to sentiment. ■ Sentiment analysis is an automatic method that provides feedback to you regarding the opinions and attitudes of your customers. The Elements of Sentiment Analysis What Is Sentiment Analysis? Sentiment analysis is an automatic method that provides feedback to you regarding the opinions and attitudes of your customers. The analysis is based on customers’ electronic written commentaries regarding your products and services and those of your competitors. The feedback can be provided at a very high level with drill-down so that you can explore how opinions differ within groups, subgroups and even at the individual level. 1
  • 5. COMBINING KNOWLEDGE AND DATA MINING TO UNDERSTAND SENTIMENT More precisely, sentiment analysis is the process of classifying or rating the opinions or sentiment expressed in a document. The rating may assign the sentiment into one of three categories: positive, negative or neutral; or it may, instead, assign a numeric score. The rating that is assigned is termed polarity. The sentiment may be assessed for the entire document or for particular objects or attributes mentioned in the document. When Is It Relevant? Sentiment analysis is relevant in almost every context that your customers or potential customers express themselves in written form – and possibly spoken form – via different communication channels. These comments may not have been intended for direct consumption by your company. They may have been posted in website forums, tweets, blogs or other Web pages and directed toward your potential customers. On the other hand, some content may have been intentionally directed at your company through e-mail, a company support website, a survey questionnaire, a call center desk, etc. Automated sentiment analysis is important to implement when you are inundated with relevant, useful feedback through these channels. For many companies, it is impossible for individuals to monitor and understand all that is communicated in these sources due to their sheer volume. The information comes too quickly and from too many channels. Sentiment analysis provides you with an immediate interpretation, not just of every individual comment but also of the global opinions expressed. Elements of Sentiment Analysis You cannot implement a comprehensive sentiment analysis solution with a process that merely analyzes the sentiment of a document. Instead, you must coordinate several tasks to maximize the benefits. 1. Data acquisition phase. This phase involves setting up an automated process to obtain a clean set of documents to analyze. You can use SAS software to obtain the documents from the Internet and from local file systems or databases. SAS software can also be used to filter the documents by eliminating any “noise” that is common to Web documents (e.g., filtering spam). 2. Sentiment assignment phase. This phase involves creating a model that can calculate the polarity of the author’s sentiment or opinion toward your topics of interest and apply that model to naïve documents. SAS technologies can help you derive accurate assessments of sentiment. 3. Summarization and reporting phase. Identifying sentiment within a particular document is interesting in itself, but frequently it will be of more interest to characterize representative populations within your collection. SAS provides techniques for such exploration, which entails answering questions such as: 2
  • 6. COMBINING KNOWLEDGE AND DATA MINING TO UNDERSTAND SENTIMENT • oes the age of our customer tend to make a difference in his or her opinion D about our service? • ow do the cumulative opinions about our competitor’s product compare with H the cumulative opinions about our product? • id our customers perceive the changes we made to our outlet stores as D beneficial, or not? 4. Repetition phase. The final step in your sentiment analysis project will be to set up a process to automate the entire analysis on a repeated basis. This allows you to monitor sentiment changes, identify important influencers and respond quickly to what you learn. For this paper we will focus primarily on the sentiment assignment phase. Note that since text is written in natural language and not with a precise quantitative representation, there are many challenges to effectively analyze for sentiment. For one, natural language text is full of ambiguities, implicit meaning and subtle nuances. Normally a human reader has the necessary experience to both understand natural language expressions and to comprehend the meaning of the subject area along with the sentiment the author intended to communicate. But automating this process in a computer can be challenging. Such things as slang, pronoun resolution, sarcasm and idioms all make a direct interpretation of the text difficult. Further, an automatic process will not function at the semantic level of the text at all unless there is a direct mapping of a linguistic rule to semantics. In many instances this can be captured with the rules we will discuss later; but the diversity of ways to express the same meaning can make it difficult to accurately capture all situations with a set of rules. There are two primary approaches to building models for sentiment analysis. The first, natural language processing, uses a domain expert to build a set of linguistic rules to determine the sentiment polarity of the document’s content. The second, machine learning, uses training data (documents that have the sentiment polarity already assigned to them) to build a predictive model. Predictive models such as decision trees, logistic regressions or neural networks will make this prediction on documents that are outside the training set. Sentiment Analysis Methods The Data We will use two collections of movie review data to demonstrate the techniques presented in this paper. The first collection created by Pang and Lee contains 2,000 3
  • 7. COMBINING KNOWLEDGE AND DATA MINING TO UNDERSTAND SENTIMENT movie reviews. The collection is split evenly with 1,000 positive and 1,000 negative reviews.1 The second collection was obtained by retrieving 6,631 movie reviews from Yahoo.2 This collection has both overall ratings for the movie being discussed and also ratings for several attributes of each movie, including the story line, cast, direction and visuals. Although your data is almost certainly not movie review data, the concepts and techniques demonstrated using this movie data are applicable to most other sentiment-related text data sets. Data Mining Approach A data mining approach to sentiment analysis translates an unstructured text problem to one that makes predictions on structured, quantitative data. The approach borrows several techniques from computational linguistics and information retrieval communities to represent the text numerically, and then applies traditional data mining techniques to this numeric representation. In the end, a target variable is identified and a pattern is discovered from the training data for predicting sentiment polarity. This pattern can then be used to predict new observations. The first step in creating the numeric representation is to convert the entire training collection into a document-by-term frequency matrix. Each document is parsed into individual terms, or term/part-of-speech pairs. Then the set of all terms becomes the variables on the data set so that documents are now represented as vectors of length equal to the number of distinct terms in the collection. These vectors are very sparse, containing mostly zeroes – because any one document contains a very small percentage of the terms in the collection. Once the documents are represented as vectors, the frequencies in each cell can be weighted with a function that takes into account the distribution of the term across the collection and relative to the levels of the target variable. After these document vectors are formed, a dimension reduction technique – such as the singular value decomposition (see Taming Text with the SVD, Albright, 2004) – is typically used to represent each document in a reduced-dimensional space of maybe 50 to 100 variables, where each variable is a linear combination of the weighted terms that originally represented each document. Finally, these reduced-dimensional vectors, together with the sentiment variable, can be supplied to a predictive model. The model will attempt to learn from the training data by utilizing patterns in the reduced-dimensional vector. This predictive model will then create a function that will predict the sentiment for any document. 1 The Pang and Lee movie review data is available at: http://www.cs.cornell.edu/People/pabo/movie- review-data 2 Yahoo movie reviews were obtained from: http://movies.yahoo.com 4
  • 8. COMBINING KNOWLEDGE AND DATA MINING TO UNDERSTAND SENTIMENT Benefits of the data mining approach The data mining approach is appealing because it is based on learning patterns that are useful for making automated, efficient predictions. The algorithms are capable of discovering unimagined and complicated patterns that would be beyond what a human could anticipate. Frequently, a data mining approach can beat a rule-based approach in topic classification. Of course, this is dependent on having enough training data to build the model. Drawback of the data mining approach The vector-based representation of a document, which is required for data mining ■ The algorithms are capable of techniques, does not maintain information that is potentially important to sentiment discovering unimagined and classification. For example, the vector representation does not capture when terms complicated patterns that would are close to one another in the document, if one term precedes another or any other contextual cues. The order of terms in a phrase can significantly affect meaning. be beyond what a human could Consider the phrases: anticipate. “… night for a great movie” and “… great night for a movie” These two phrases convey two different meanings; yet in a vector representation, the phrases have an identical representation. In addition, most predictive models provide little feedback to the user as to precisely why a particular document was classified as having positive or negative polarity. So when you attempt to understand what positive things people said in a particular document, you frequently have to read the entire document to discover the answer. As a final drawback, forming the training and validation is an essential component of learning a predictive model, but it can be very time-consuming and challenging. A rating needs to be provided for every document, and if there are attributes of documents that you wish to use to measure sentiment, you will need to provide a rating for each of these as well. Another complication is that two different reviewers frequently assign two different sentiment ratings to the same document. This can introduce unexpected errors in building and measuring the performance of your model. Natural Language Processing Approach Natural language processing (NLP) is a field of artificial intelligence that deals with automatically extracting meaning from natural language text. As discussed in the introduction of this paper, it’s very challenging to get machines to understand text at the same levels as humans. Doing this with the specific goal of extracting sentiment is even more challenging. For example, consider the text snippet below: 5
  • 9. COMBINING KNOWLEDGE AND DATA MINING TO UNDERSTAND SENTIMENT “… with that out of the way, let me say this – this film is bad. This film is really, really bad. Yet somehow, it is strangely enjoyable. …” If interpreted by a human, the above text would imply a positive sentiment from the author toward the movie. However, it can be very challenging to get the same output from a computer because of the dense presence of the strongly negative words. The rule-based NLP methods use certain entities and syntactic patterns in the text to understand its meaning. SAS Sentiment Analysis provides all the tools needed for this kind of disambiguation. You can use a combination of language dictionaries, linguistic constructs like parts of speech, and noun phrases along with a range of operators. The operators fall into a few different categories as shown below: • Boolean operators. Used to include or exclude different entities (e.g., AND, OR, NOT). • Frequency operators. Used to measure the specified number of occurrences of certain entities, (e.g., MIN, MINOC, MAXOC). • Context operators. Used to measure the context within which certain entities occur in the text (e.g., DIST, START, END, SENT, PARA). • Sequence operators. Used to look for the entities in a specific sequence (e.g., ORD, ORDDIST). The process of developing rule-based models for sentiment analysis involves a few different steps. These are explained below. Step one: taxonomy identification The initial step in the NLP approach is taxonomy identification. Taxonomy here refers to a simple, two-level hierarchy where you specify the different objects and attributes for which you want to extract sentiment. You can either use a predefined taxonomy or you can use text mining to learn the most prominent objects and their attributes in the corpus and then make them part of your taxonomy. Figure 1 shows the predefined taxonomy that we used for extracting sentiment from the movie review data. The discovery-based text mining methods are discussed later in this paper. 6
  • 10. COMBINING KNOWLEDGE AND DATA MINING TO UNDERSTAND SENTIMENT Figure 1: Taxonomy for movie reviews. Step two: defining objects and attributes The next step is to define the objects and their attributes. A basic approach to defining these is to identify their synonyms or the different ways they may be referred to in the text. Figure 2 shows an example. Figure 2: Example of defining the visuals attribute. While this approach captures many cases, in other situations the attribute might be referred to using its co-referent. Consider the example below: “The movie starred Jennifer Aniston. The plot of the movie was very interesting. Aniston’s performance was commendable. She looks adorable.” 7
  • 11. COMBINING KNOWLEDGE AND DATA MINING TO UNDERSTAND SENTIMENT Here the name of the actress was mentioned only in the first sentence. In the subsequent sentences, the actress was referred to using her last name and a pronoun. These three entities are said to be co-referent and the process of identifying them is called co-reference resolution. The rule-based methods allow you to write rules to handle such cases. Step three: defining polarity Polarity is determined by associating predefined positive or negative terms or expressions with the attributes that have been identified. Dictionaries of subjective expressions are available and can be customized to specific domains (see Figure 3). Figure 3: Example of a generic dictionary of positive keywords. You could also define multiple classes of subjective expressions to denote different levels of subjectivity. “incredible,” “stunning” ➔ strong positive “hate,” “disgust” ➔ strong negative Assigning the appropriate polarity requires that negations are handled properly. To do this, you can use a combination of part-of-speech tags and dictionaries as shown in Figures 4 and 5. 8
  • 12. COMBINING KNOWLEDGE AND DATA MINING TO UNDERSTAND SENTIMENT Figure 4: Example of a class of negated adjectives. In Figure 4, “NegClass” is a dictionary of expressions that denote a negation. For example, “not,” “will not,” “have not,” etc. and “:Adv,” “:A” and “:V” represent any adverb, adjective and verb respectively. Figure 5: Example of a negation rule. Finally, to extract the sentiment at attribute level, you can write context-based rules as shown in Figure 6, where we used a combination of operators. 9
  • 13. COMBINING KNOWLEDGE AND DATA MINING TO UNDERSTAND SENTIMENT ■ The major advantage of rule-based methods is the amount of control they give rule developers over how the analysis will be performed. Figure 6: Example of an attribute-level sentiment rule. Benefits of the NLP approach The major advantage of rule-based methods is the amount of control they give rule developers over how the analysis will be performed. Developers can use their knowledge of the domain and the language within it to develop rules that have high precision. Unlike statistical analysis, the results of rule-based analysis are easily interpretable. This is very important for real-life applications where the analysts need to know exactly why a document or an attribute within a document was tagged as positive or negative. In other words, analysts need to know exactly what sentences, keywords or context within the document triggered the positive or negative sentiment. Figure 7 shows an example of this. I think they did a fantastic job this movie. I read the book, I loved the book, and I loved the movie! My only qualm was Javier bardem playing a Brazilian when he is SPANISH! Julia Roberts was perfect and beautfiul. Wonderful casting job (with the exception of Bardem)! Good acting. Some parters were a tad confusing for those who haven’t read the book. But I took my mom, who didn’t read the book, and she really liked it. br/ br/ It’s not just some sappy chick flick. It’s a powerful journey about finding yourself hen you let yourself GO!br/ br/ Empowering.br/ Perfection. = EAT PRAY LOVE!br/ Lovely Figure 7: Example showing different entities that were used for rule-based analysis. Rule-based methods are completely unsupervised; that is, they do not require any training data. This is a big advantage in real-life applications where training data is scarce. The non-availability of training data is more pronounced when it comes to granular sentiment analysis (sentiment derived at the objects and attributes level). 10
  • 14. COMBINING KNOWLEDGE AND DATA MINING TO UNDERSTAND SENTIMENT Another advantage of rule-based methods is their ability to refine the rules over time based on the feedback from analysts or subject-matter experts. The more time the rule developer spends on refining the rules, the better the results. Language evolves over time and people start using newer terms to express their sentiments. This is especially true for social media, where the language used changes all the time. In such cases, rule-based methods give you the flexibility needed to adjust your models accordingly. Drawback of the NLP approach The disadvantage of rule-based methods is that they require a lot of human involvement in developing the rules. These methods completely rely on the domain knowledge of rule developers. It might take a few weeks to come up with a strong rule-based model for a new domain. However, once you have a strong rule-based model for a domain, you can reuse that model with some minor modifications for different applications within the domain. The importance of validation data is often underestimated while developing these models. The rules being written must be generic enough so that they are capable of handling all possible cases. Inexperienced rule developers tend to over-fit their rules to the sample data they are working with. Such rules might not work well when tested on different data sets. So, rule developers must make sure they validate the rules on different data sets before considering a model ready to deploy. The Best of Both Worlds As we discussed earlier, data mining learns relevant patterns from a numerical representation of the entire collection, and the patterns discovered are derived by analyzing the collection as a whole. The rule builder, on the other hand, relies only ■ Because they approach the problem on personal experience and knowledge to formulate rules that will be useful for sentiment analysis. so differently, data mining and rule- based systems can complement one Because they approach the problem so differently, data mining and rule-based another. systems can complement one another. They can do this in two ways. First, unsupervised data mining can be used as a tool for the rule builder; and second, the supervised data mining model can be combined with the rule-based model in such a way that the strengths of each model are combined, and any possible mistakes made by one model can be corrected by the other. Data Mining of the Text for the Rule Builder The challenge of the rule builder is to devise and formulate rules that capture the sentiment contained in the collection. To do this, the rule builder must have some understanding of the content of the documents that are being categorized. For 11
  • 15. COMBINING KNOWLEDGE AND DATA MINING TO UNDERSTAND SENTIMENT instance, in our movie review collection, are all the reviews about a specific movie or are they about a specific genre of movies? If we know, we can save time by writing rules that are only directed to a particular movie or genre. On the other hand, if the reviews are about movies from many different genres, we must consider how that knowledge affects the rules we write. Otherwise, we might not capture the sentiment accurately. For instance, when discussing a horror movie, the statement “The scariest thing I have ever seen” is typically an indicator that the reviewer enjoyed the movie. But it could be a negative indicator if the reviewer was discussing a children’s movie. Unsupervised text mining allows you to quickly get a handle on the collection you are examining without spending time reading many individual documents. SAS Text Miner provides a node both for generating topics within a document and for clustering the documents. These approaches are useful for understanding the collection and for revealing significant aspects of the data. Table 1 shows that our collection is quite varied. ID Descriptive Terms Freq. Pct. 1 + horror, + killer, + scary, + scream, horror, + reason, last, 155 8% minutes 2 + animation, adults, animated, disney, voice, children, 73 4% kids, + feature 3 coen, fargo, money, wife, different, pretty, sequences, 37 2% guy 4 + war, world, life, love, + sense, + fight, right, + father 267 13% 5 + comedy, jokes, + funny, funny, fun, script, back, cast 213 11% 6 earth, effects, special effects, special, star, + action, + 276 14% people, interesting 7 + action, + fight, sequences, bad, fun, guy, special ef- 177 9% fects, acting 8 + comedy, mother, + father, woman, funny, love, + family, 400 20% high 9 performances, mother, performance, love, down, + point, 117 6% last, different 10 + thriller, case, + action, + killer, wife, + job, performance, 285 14% script Table 1: Ten clusters from the Pang and Lee data. The clusters reveal several prominent categories of movies, reminding rule builders that they need to consider how people express sentiment in the following types of movies: • Horror movies. • Animation and children’s movies. 12
  • 16. COMBINING KNOWLEDGE AND DATA MINING TO UNDERSTAND SENTIMENT • Comedies. • Science fiction movies. • Action movies. • Thrillers. If you, as the rule builder, had not been thinking of how people express their opinions about movies from these different categories, it could be easy to incorrectly capture the sentiment contained in them. Further discovery can be done to capture the sentiment of individual attributes within the document. For instance, since the SAS Text Miner filter node allows you to subset documents that contain the visual attribute synonyms displayed in Figure 2, you can subset the collection accordingly. In Figure 8, the search expression has been set to include only those documents that contain at least one of the visual attribute synonyms used in the rule building. The special character “*” implies a wildcard search is to occur, and the quoted input means that only the exact phrase, “special effects,” should match. The filter node can be followed with a clustering or topic node, and then any analysis of this subsetted collection provides you with some potential new ideas for rules. Figure 8: A search expression to retrieve documents concerned with the visual sentiment attribute. This particular subsetted collection revealed discussions around costumes and costume designs, as well as the reviewer’s reaction to the theater setting. Neither of these were aspects of visual sentiment that we had considered prior to discovering these topics. At an even finer level, the reports of important terms and phrases (particularly in relation to one another in the concept-linking diagram) provide sentence-level ideas for your rule generation. The diagram in Figure 9 was made in the process of exploring reviewers’ comments on their theater experience. The diagram suggests that the sentiment regarding the music or sound in the movie might be another attribute that could be added to the taxonomy and examined. 13
  • 17. COMBINING KNOWLEDGE AND DATA MINING TO UNDERSTAND SENTIMENT Figure 9: A concept link diagram of “music” and “loud.” Hybrid Approaches ■ Hybrid approaches involve using Hybrid approaches involve using a rule-based approach and a data mining approach a rule-based approach and a data in combination. In the next sections we will describe two alternative methods. The mining approach in combination. first method can be used to supplement the features from the traditional data mining model by adding features derived from the linguistic rules that are triggered. The second method shows how to use an ensemble of the results of the two distinct approaches to improve the prediction. Polarity scores as additional features One advantage of SAS Text Miner is that it allows additional features associated with the document to be combined with the term features or with the SVD dimensions before training the predictive model. Polarity scores are simply a summary score based on a function of the number of times the positive and the negative rules trigger in a document, or in an attribute of a document. These values can be obtained from SAS Sentiment Analysis. 14
  • 18. COMBINING KNOWLEDGE AND DATA MINING TO UNDERSTAND SENTIMENT Once obtained, the logistic function can be applied to the ratio of the weighted positive and negative counts so that a document’s polarity score will be between 0 and 1, inclusively. A document with more positive sentiment weight will be assigned a score closer to 1, and a document that tends to have more negative sentiment scores closer to 0. This score is then used in combination with the SVD dimensions. When the document has several attributes that receive a polarity score, each of these scores can be added as features to the text mining model. The hybrid model within SAS Sentiment Analysis software also makes use of this approach. Stacked models Another hybrid approach is to stack the models. This means that the rule-based and the data mining models are run separately in the first stage; but a second, predictive model is “stacked” after these two models so that the output of the two (a predictive probability for each document from each model) becomes the input into a second- stage model. Stacking is an ensemble method that can improve accuracy if the two first-stage models differ in their predictions. Stacking allows for the two models to potentially correct one another where they differ. In Figure 10, SAS Text Miner is used to build one sentiment model, while the model import node brings in a model from SAS Sentiment Analysis. The output of the two models is massaged with SAS code, and then goes into the second stage regression for a final prediction. Figure 10: Stacking models. 15
  • 19. COMBINING KNOWLEDGE AND DATA MINING TO UNDERSTAND SENTIMENT Results We experimented with the sentiment analysis approaches presented in this paper using the movie review data sets. The Yahoo movie data set was used to analyze sentiment at the attribute level, and the Pang and Lee data set was used for the overall sentiment predictions. Attribute-Level Results Table 2 shows the results for the attribute-level sentiment analysis on the Yahoo movie data. The Yahoo data had explicit user ratings for the different attributes, and we compared those ratings with the predictions made by the rule-based model developed with SAS Sentiment Analysis. We spent three days on the rule- development process. The Yahoo data included some reviews where a user rating was available for a particular attribute, but the attribute itself was not discussed in the text of the review. We did not include such reviews in the evaluation of the attribute. We also did not include the general attribute because no user ratings were available for it. A user rating of C+ or higher was considered positive, and C- or lower was considered negative. Num Reviews Misclass Rate Story 972 .23 Cast 1272 .14 Direction 243 .17 Visuals 459 .12 Aggregate 2946 .18 Table 2: Attribute-level results. With just three days of effort on rule development, we were able to achieve an overall precision of 82 percent at the attribute level. The misclassification rate for the story attribute was relatively higher than the other attributes. That is an indication to the rule developer to further refine the rules for that attribute. Rule refinement is an ongoing process, and precision can improve over a period of time. Overall Results Table 3 shows the results of our comparisons of the Pang and Lee data. For the data mining approach, 1,800 random movie reviews were used for training a model, 16
  • 20. COMBINING KNOWLEDGE AND DATA MINING TO UNDERSTAND SENTIMENT and 200 reviews were held out to be scored. This process was repeated four times, and the misclassification scores were averaged. For each run, the same set of 200 reviews was analyzed in SAS Sentiment Analysis so that the comparisons were made on the same set of data. Approach Misclass Rate 1 SAS Text Miner .144 2 SAS Sentiment Analysis .252 Attribute-Level Rules 3 Add Polarity Scores as .132 Features in SAS Text Miner 4 Blended .139 Table 3: Overall sentiment misclassification results. The results obtained with the text mining model were achieved by using a category- specific weighting and by having enough training data. The SAS Sentiment Analysis overall sentiment model was derived from the rules for the individual attributes. Under these conditions, the rule-based model did not perform as well as the SAS Text Miner model. However, combining the models – by using the polarity scores as features in the SAS Text Miner model, or by blending the two models – did improve results. Other Applications Importing Models SAS Sentiment Analysis can build a hybrid model using rules combined with a Naïve Bayes algorithm. However, to leverage all the predictive analysis advantages of SAS® Enterprise Miner™ software, the models from SAS Sentiment Analysis must be imported into SAS Enterprise Miner. This can be done easily by using the SAS Enterprise Miner model import node. Once the output of SAS Sentiment Analysis is imported, models can be combined in various ways and then compared with the model assessment node. Figure 11 shows the receiver operator curve (ROC) plot from the model assessment node after a SAS Sentiment Analysis model was imported. 17
  • 21. COMBINING KNOWLEDGE AND DATA MINING TO UNDERSTAND SENTIMENT Figure 11: ROC chart of SAS Enterprise Miner models with an imported SAS Sentiment ■ One approach to creating training Analysis model (denoted by model import). In this graph, “TM” denotes SAS Text Miner and “RuleIn” refers to using SAS Sentiment Analysis rules in conjunction with data is to use very precise rules that SAS Text Miner. will make a sentiment classification only on the documents you are most Creating Training Data sure about. As discussed earlier, training data that has the “answers” is an essential part of a text mining approach. It is necessary to build a predictive model that can make accurate sentiment predictions. It is also important for a rule-based system because it validates how your rules are doing. The feedback lets you know if you need to add or remove specific rules, or if you must refine certain rules. Unfortunately, training data is not always available, and creating this data can be an expensive time commitment. One approach to creating training data is to use very precise rules that will make a sentiment classification only on the documents you are most sure about. At the risk of not assigning a sentiment category to many of the documents, you do assign sentiment to a small subset of documents. 18
  • 22. COMBINING KNOWLEDGE AND DATA MINING TO UNDERSTAND SENTIMENT We applied this approach to the movie review data by choosing rules that captured complete phrases that seemed, in our opinion, to indicate the overall sentiment. For instance, we included a set of rules that would trigger a positive score for a review that contained phrases like: “I thoroughly enjoyed this movie.” or “I totally loved the film.” When these types of phrases occurred in the document, the polarity was rated positive. Similarly, corresponding precise rules were added for negative polarity. When we applied this approach to our movie review collection, 103 of the 2,000 documents triggered our rules. (While 103 documents is too small for an effective set of training data, with a larger pool of 20,000 reviews we would have likely obtained 1,000 documents in the training set.) We still confirmed the polarity by reviewing each of the 103 documents. Since SAS Sentiment Analysis highlights the rules in context, it was quick work to check the 103 documents to ensure that it was an appropriate trigger. Based on our manual review, it appeared that eight of the 103 documents were incorrect, so we corrected the polarity for those so that our training data would be free of errors. Other Capabilities of SAS® Enterprise Miner™ This paper has primarily focused on combing the rule-based capabilities of SAS Sentiment Analysis with the text mining capabilities of SAS Text Miner, in conjunction with the predictive models available in SAS Enterprise Miner. There is much more functionality in SAS Enterprise Miner that can be used to help you understand the sentiment contained in a collection and to build on the rule models you have developed. Such functionality as sequences and associations, decision trees, SOM- Kohonen self-organizing maps, variable clustering, transformations and sampling, and statistical exploration have all been used in various contexts to supplement textual understanding. Conclusions Independently, both the domain knowledge and the data mining approaches to sentiment analysis have their strengths and weaknesses; but hopefully you will not be forced to choose between using one or the other for your analysis. In this paper, we have shown that the two approaches complement one another. So, while the NLP approach leverages the rule builder’s domain knowledge, text mining can also be used by that person to improve, clarify or correct how that knowledge relates to the particular collection being analyzed. Text mining reveals important patterns in the specific collection that assist the rule builder. 19
  • 23. COMBINING KNOWLEDGE AND DATA MINING TO UNDERSTAND SENTIMENT On the other hand, the text mining approach allows you to quickly build a sentiment classifier with term frequencies alone. But without any semantic or syntactic indicators, mistakes that would seem elementary to a human can easily occur. We have shown that these linguistic indicators can be captured by a rule-base system and then leveraged in the statistical classifier as additional features, or as a blended model. The end result is a model that is better than either one individually. References 1 Albright, Russ. Taming Text with the SVD. January 2004. SAS: Cary, NC. Web: http://ftp.sas.com/techsup/download/EMiner/TamingTextwiththeSVD.pdf. 2 Pang et al. “Thumbs Up? Sentiment Classification Using Machine Learning Techniques.” Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). Conference on Empirical Methods in Natural Language Processing. 2002. 79-86. The authors thank James Cox and Janardhana Punuru from the SAS Text Analytics Research and Development team for their helpful comments and suggestions. They also thank Fiona McNeill from SAS Marketing for encouraging them to work on this paper and providing valuable feedback. 20
  • 24. SAS Institute Inc. World Headquarters   +1 919 677 8000 To contact your local SAS office, please visit: www.sas.com/offices SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies. Copyright © 2011, SAS Institute Inc. All rights reserved. 105008_S59083.0211