Developer Data Modeling Mistakes: From Postgres to NoSQL
Sentiment Analysis
1. Thumbs up? Sentiment Classification
using Machine Learning Techniques
- Bo Pang and Lillian Lee
- Shivakumar Vaithyanathan
2. What is it??
• Input – raw text over some topic
• Output – opinion ( +ve, -ve or neutral )
• Its is hard – why???
- determines the opinion on overall text rather
than just subject of the topic
-- lets understand the problem
3. We know …
• Web – enormous amount of data
• Topical categorization – active research
4. Rise of blogs, forums …
• Web 2.0 is commonly associated with web
applications that facilitate interactive information
sharing, interoperability, user-centered
design, and collaboration on the World Wide
Web – (source : Wikipedia)
5. Why is it interesting?
• Represents the voice about particular topic
from broader audience
• Example : product reviews, movie reviews,
book reviews
• Important to business intelligence applications
- What do people (dis)like in Nikon D40
6. What this paper does
• Examines the effectiveness of applying
machine learning techniques to sentiment
classification problem
• Challenging – while topic are identifiable by
keywords alone, sentiment can be expressed
in a more subtle manner.
7. Dataset : Movie-Review Domain
Reason :
– Large online collection for reviews
– Easy to summarize with machine-extractable
rating indicator than to handle data for supervised
learning
Corpus of 752 –ve, 1301 +ve, with total 144
reviewers represented
8. Naïve approach
• Idea: people tend to use certain words to
express strong sentiments, produce such list
and rely to classify text
9. Machine Learning methods
• Let {f1, f2, …, fm} be predefined m features
that can appear in document.Example : “still”
or bigram “really stinks”
• ni(d) – number of times fi occurs in document
d
• Document vector(d) = (n1(d), n2(d), …, nm(d))
11. Maximum Entropy
• Idea is to make fewest assumptions about the
data while still being consistent with it
12. Support Vector Machines(SVM)
• Are large-margin, non-probabilistic classifiers
in contrast to Naïve Bayes and Maximum
Entropy
• Letting (corresponding to +ve,-
ve), be the correct class of document dj,
13. Evaluations
• Randomly selected 700 positive, 700 negative
sentiment documents
• Automatically removed rating indicators,
extracted textual information from original
HTML
• Added NOT_ to every word between a
negation word(“not”, “isn’t”) and first
punctuation.
15. Conclusion
• Unigram presence information turned out to
be most effective
• The superiority of presence information in
comparison to feature frequency indicates a
difference between sentiment and topic
categorization.