Sentiment Analysis/Opinion Mining of Twitter Data on Unigram/Bigram/Unigram+Bigram Model using:
1. Machine Learning
2. Lexical Scores
3. Emoticon Scores
YouTube Video: https://youtu.be/VuR16P87yPE
Link to the WebPage: http://akirato.github.io/Twitter-Sentiment-Analysis-Tool
Github Page: https://github.com/Akirato/Twitter-Sentiment-Analysis-Tool
2. Hello!
We are Team 10
Member 1:
Name: Nurendra Choudhary
Roll Number: 201325186
Member 2:
Name: P Yaswanth Satya Vital Varma
Roll Number: 201301064
3. Introduction:
Twitter is a popular microblogging service where users
create status messages (called "tweets").
These tweets sometimes express opinions about different
topics.
Generally, this type of sentiment analysis is useful for
consumers who are trying to research a product or
service, or marketers researching public opinion of their
company.
4. AIM OF THE PROJECT
The purpose of this project is to build
an algorithm that can accurately
classify Twitter messages as positive
or negative, with respect to a query
term.
Our hypothesis is that we can obtain
high accuracy on classifying
sentiment in Twitter messages using
machine learning techniques.
5. The details of the dataset used for
training the Classifier.
1.
Dataset
6. 1600000 sentences annotated as positive, negative.
http://help.sentiment140.com/for-students/
Sentiment140 Dataset
8. ➜ Case Folding of the Data (Turning everything
to lowercase)
➜ Punctuation Removal from the data
➜ Common Abbreviations and Acronyms
expanded.
➜ HashTag removal.
Steps in Preprocessing:
10. 2.1 Training Distributed Semantic Representation
(Word2Vec Model)
➜ We use a Python Module called gensim.
models.word2vec for this.
➜ We train a model using only the sentences
(after preprocessing) from the corpus.
➜ This generates vectors for all the words in the
corpus.
➜ This model can now be used to get vectors for
the words.
➜ For unknown words, we use the vectors of
words with frequency one.
11. 2.2 Language Model
Unigram
The word vectors
are taken
individually to train.
E.g: I am not dexter.
Is taken as:
[I, am, not, dexter]
Bigram
The word vectors
are taken two at a
time to train.
E.g: I am not dexter.
Is taken as:
[(I,am), (am,not),
(not,dexter)]
Unigram + Bigram
Use unigram
normally but bigram
when words
reversing sentiments
like not,no,etc are
present.
E.g: I am not dexter.
Is taken as:
[I,am,(not,dexter)]
12. 2.3 Training For Machine Learning Scores
1. Use the various language models and train various
two-class classifiers for results.
2. The classifiers we used are:
a. Support Vector Machines - Scikit Learn Python
b. Multi Layer Perceptron Neural Network - Scikit
Learn Python
c. Naive Bayes Classifier - Scikit Learn Python
d. Decision Tree Classifier - Scikit Learn Python
e. Random Forest Classifier - Scikit Learn Python
f. Logistic Regression Classifier - Scikit Learn
Python
g. Recurrent Neural Networks - PyBrain module
Python
13. Logistic Regression:
Logistic regression is a powerful statistical way of modeling a
binomial outcome (takes the value 0 or 1 like having or not having
a disease) with one or more explanatory variables.
Naive Bayes Classifier:
Try solving the problem with a simple classifier.
Multi-Layer Perceptron Neural Network Classifier:
The method has significantly increased results in binary
classification compared to classical classifiers.
Recurrent Neural Networks:
This class of neural networks have significantly improved results
for various Natural Language Processing Problems. Hence, this
was tried too.
2.3.1 Reasons for using the Classifiers
14. Decision Trees:
Decision trees are very intuitive and easy to explain. Decision
trees do not require any assumptions of linearity in the data.
Random Forest:
Decision Trees tend to overfit. Hence an ensemble of them gives a
much better output for unseen data.
Support Vector Machines:
This classifier has been proven by a lot of research papers to give
the best result among the classical classifiers.
2.3.1 Reasons for using the Classifiers
15. 2.3.2 Accuracies of Various Approaches
(Accuracies are calculated using 5-fold cross-validation)
Unigram Bigram Unigram + Bigram
Support Vector
Machines 71.1%
-NA-
(Takes too much
time to train,
stopped after 28
hours)
74.3%
Naive Bayes
Classifier 64.2% 62.8% 65.0%
Logistic Regression 67.4% 72.1% 71.6%
16. 2.3.2 Accuracies of Various Approaches
(Accuracies are calculated using 5-fold cross-validation)
Unigram Bigram Unigram + Bigram
Decision Trees 60.4% 60.0% 61.5%
Random Forest
Classifier 67.1% 70.8% 71.3%
Multi-Perceptron
Neural Network
Classifier
68.6% 72.7% 74%
17. 2.3.2 Accuracies of Various Approaches
(Accuracies are calculated using 5-fold cross-validation)
Unigram Bigram Unigram + Bigram
Recurrent Neural
Networks 69.1% 70.4% 71.5%
18. 2.3.4 Based on the Above
Results:
We chose
Unigram+Bigram with
Random Forest Classifier to
be the part of our system
as they gave the best
results.
19. Emoticons play a major role in deciding
the sentiment of a sentence, hence
Emoticon Scoring
20. Emoticon Scoring
Use a
dictionary to
score the
emoticons.
Use this
emoticon score
in the model.
Search for
Emoticons in
the given text
using RegEx or
find.
21. Get the text
Lexical Scoring
(Scoring based on words of the text)
Lemmatize the
text
The Score will be
used in the final
system.
This will be given
more weightage as
this is more definite
Score the
Lemmatized text
using dictionaries
22. Training Classifier and Word2Vec Model
Preprocessing
Train Word2Vec
Model
Annotated
Training Data
Sentence VectorSentences Classifier ModelTrain using various
classifier algorithms
23. The Overall Scoring Process Goes Like This
Lexical Scores
Classifier Scores
Emoticon Scores
Unigram
Bigram
Sentences Overall Scores
Weight of Lexical
Scores
Weight of
Emoticons
24. Challenges in the Approach
Randomness in
Data
Twitter is written by
Users, hence it is not
very formal.
Emoticons
Lots of types of
emoticons with new
ones coming very
frequently.
Abbreviations
Users use a lot of
abbreviation slangs like
AFAIK, GN, etc.
Capturing all of them is
difficult.
Grapheme
Stretching
Emotions expressed
through stretching of
normal words.
Like, Please ->
Pleaaaaaseeeeee
Reversing Words
Some words completely
reverse sentiment of
another word.
E.g: not good ==
opposite(good)
Technical
Challenges
Classifiers take a lot of
time to train, hence silly
mistakes cost a lot of
time.
25. Future Improvements
➜ Handle Grapheme Stretching
➜ Handle authenticity of Data and Users
➜ Handle Sarcasm and Humor
26. Thanks!
Github Link to the Project:
https://github.com/Akirato/Twitter-Sentiment-Analysis-Tool
Any questions?
You can mail us at:
nurendra.choudhary@research.iiit.ac.in
Or
satyavital.varma@students.iiit.ac.in