Sentiment Analysis

Thumbs up? Sentiment Classification
using Machine Learning Techniques
- Bo Pang and Lillian Lee
- Shivakumar Vaithyanathan

What is it??
• Input – raw text over some topic
• Output – opinion ( +ve, -ve or neutral )
• Its is hard – why???
- determines the opinion on overall text rather
than just subject of the topic
-- lets understand the problem

We know …
• Web – enormous amount of data
• Topical categorization – active research

Rise of blogs, forums …
• Web 2.0 is commonly associated with web
applications that facilitate interactive information
sharing, interoperability, user-centered
design, and collaboration on the World Wide
Web – (source : Wikipedia)

Why is it interesting?
• Represents the voice about particular topic
from broader audience
• Example : product reviews, movie reviews,
book reviews
• Important to business intelligence applications
- What do people (dis)like in Nikon D40

What this paper does
• Examines the effectiveness of applying
machine learning techniques to sentiment
classification problem
• Challenging – while topic are identifiable by
keywords alone, sentiment can be expressed
in a more subtle manner.

Dataset : Movie-Review Domain
Reason :
– Large online collection for reviews
– Easy to summarize with machine-extractable
rating indicator than to handle data for supervised
learning
Corpus of 752 –ve, 1301 +ve, with total 144
reviewers represented

Naïve approach
• Idea: people tend to use certain words to
express strong sentiments, produce such list
and rely to classify text

Machine Learning methods
• Let {f1, f2, …, fm} be predefined m features
that can appear in document.Example : “still”
or bigram “really stinks”
• ni(d) – number of times fi occurs in document
d
• Document vector(d) = (n1(d), n2(d), …, nm(d))

Naïve Bayes
Assign to a given document d the class
Naïve Bayes rule :

Maximum Entropy
• Idea is to make fewest assumptions about the
data while still being consistent with it

Support Vector Machines(SVM)
• Are large-margin, non-probabilistic classifiers
in contrast to Naïve Bayes and Maximum
Entropy
• Letting (corresponding to +ve,-
ve), be the correct class of document dj,

Evaluations
• Randomly selected 700 positive, 700 negative
sentiment documents
• Automatically removed rating indicators,
extracted textual information from original
HTML
• Added NOT_ to every word between a
negation word(“not”, “isn’t”) and first
punctuation.

Conclusion
• Unigram presence information turned out to
be most effective
• The superiority of presence information in
comparison to feature frequency indicates a
difference between sentiment and topic
categorization.

Sentiment Analysis

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (18)

Similar a Sentiment Analysis

Similar a Sentiment Analysis (20)

Último

Último (20)

Sentiment Analysis