Experiences with Sentiment Analysis with Peter Zadrozny

The contents of this
presentation are part of the
book Big Data Analytics
Using Splunk by Peter
Zadrozny and Raghu Kodali

Introduction
The technical side
The Splunk sentiment analysis app
The world sentiment indicator project
Conclusions
Agenda

Sentiment Analysis
Is the process of examining text or speech to find out
the opinions, views or feelings of the author or speaker
This definition applies to a computer system
When a human does this, it s called reading
The words in the title describe highly subjective and
ambiguous concepts for a human
Even more challenging for a computer program
Opinions, Views, Beliefs, Convictions

Words or expressions have different meanings
depending on the knowledge domain (domain of
expertise)
Example: Go Around
Sarcasm, jokes, etc.
Domains of expertise usually have slang
Conclusion:
Sentiment is contextual and domain dependent
Opinions, Views, Beliefs, Convictions

Analysis tends to be done by
Domain of expertise
Media channel
Newspaper articles follow grammar rules, use proper words,
no orthographical mistakes
Tweets lack sentence structure, likely use slang, include
emoticons ( , ) and sometimes words are lengthened ( I
looooooove chocolate )
Sentiment Analysis

Companies want to know what their
Customers
Competitors
General public
Think about their
Products
Services
Brands
Usually associated with marketing and public relations
Commercial Uses

When done correctly, sentiment analysis is powerful
From Tweets to Polls: Linking Text Sentiment to Public
Opinion Time Series , O'Connor et al. 2010
Analysis of surveys on consumer confidence and political
opinion correlate to sentiment word frequencies in Twitter
by as much as 80%
These results highlight the potential of text streams as a
substitute and supplement for traditional polling.
Commercial Uses

When not well done
"The Hathaway Effect: How Anne Gives Warren Buffet a
Rise", Dan Mirvish, Huffington Post, 2011
Suspicions that some robotics trading programs in Wall
Street include sentiment analysis
Every time Anne Hathaway makes the headlines, the
stock of Warren Buffet s company Berkshire-Hathaway
goes up
Commercial Uses

Sentiment Analysis is text categorization
The results fall into two categories
Polarity
Positive, negative, neutral
Range of polarity
Ratings or rankings
Example: 1 to 5 stars for movie reviews
The Technical Side

Extracting and categorizing sentiment is based on features
Frequency: Words that appear most often decide the polarity
Term Presence: Most unique words define polarity
N-Grams: The position of a word determines polarity
Parts of Speech: Adjectives define the polarity
Syntax: Attempts to analyze syntactic relations haven t been very
successful
Negation: Explicit negation terms reverse polarity
Text classifiers tend to use combinations of features
The Technical Side

To assign contextual polarity, you need a base
polarity
Use a lexicon, which provides a polarity for each word
Word Phrase Sentence Document
Use training documents
Preferred
The Technical Side

Training documents
Contain a number of sentences
Are classified with a specific polarity
Polarity for each word is based on a combination of
feature extractors and its appearance in the different
classifications
The more sentences, the more accurate
Results are placed in a model
The Technical Side

Machine learning tools
Naïve Bayes Classifier
Generally use N-grams, frequency, and term of presence. Sometimes
part of speech
Maximum Entropy
Bayes assumes each feature is independent, ME does not
Allows for overlap of words
Support Vector Machines
One vector per feature
Linear, polynomials, sigmoid and other functions are applied to the
vectors
The Technical Side

The Technical Side
TrainerNeutral
Negative
Positive
Training
Corpus
Model
TesterNeutral
Negative
Positive
Testing
Corpus
Processor
Accuracy &
Margin of Error
Document
Sentiment

The Splunk Sentiment Analysis App

Based on the Naïve Bayes Classifier
Has three commands
Sentiment
Language
Token
Includes a training/testing program and two models
Twitter: 190,862 positive and 37,469
IMDb
Range of polarity from 1 to 10
Each ranking has 11 movie reviews, averaging 200 words

index=twitter lang=en
| where like(text, %love% )
| sentiment twitter text
| stats avg(sentiment)

| rename entities.hashtags{}.text as hashtags
| fields text, hashtags
| mvexpand hashtags
| where like(hastags, Beliebers )
The Beliebers Search

| mvexpand hashtags
So that we don t have to type
entities.hashtags{x}.text everytime we
want to refer to a hashtag, rename this
multi-value field to hashtags

| mvexpand hashtags
We only want the fields that contain the
tweet and the hashtags

| mvexpand hashtags
Expand the values of this multi-value
field into separate Splunk events

The training corpus is key to accuracy
Beware: Naïve Bayes is not an exact algorithm
The best accuracy obtained using Naïve Bayes is
approximately 83%
Key factors to increase accuracy
Similarity to the data being analyzed
Size of the corpus
Training and Testing Data

Training and Testing Data
Test Data Size Accuracy Margin of
Error
University
of Michigan
1.5 Million 72.49% 1.05%
Splunk 228,000 68.79% 1.12%
Sanders 5,500 60.61% 0.76%

Love, Hate & Justin Bieber: Sanders Model

The World Sentiment Indicator Project

Based on news headlines
From news web sites all around the world
Collecting RSS feeds in English
The World Sentiment Indicator

Steps for this project
1. Collect the RSS feeds
2. Index the headlines into Splunk
3. Define the sentiment corpus
4. Create a visualization of the results
The World Sentiment Indicator

Create your own
Crowd-source
University of Michigan ‒ Kaggle competition
Bootstrap
Twitter Sentiment Classification Using Distant Supervision , Go et al,
2010
Uses emoticons to classify tweets
Accuracy for unigrams and bigrams
Naïve Bayes 82.7%
Maximum Entropy 82.7%
Support Vector Machine 81.6%
Training Corpus Creation

Issues with subjectivity
Pope Benedict XVI announces resignation
Pope too frail to carry on
Pope steps down as head of Catholic church
Pope quits for health reasons
Average size of RSS headline 47.8 chars, 7.6 words
Twitter average 78 characters, 14 words
Training Corpus Considerations

Create a special corpus based on news headlines
Version 1: 100 positive, 100 negative, 100 neutral
Version 2: 200 positive, 200 negative, 200 neutral
Use an existing Twitter corpus
The one included with the Splunk app
University of Michigan
Use a movie review corpus
Pang & Lee: 1,000 positive, 1,000 negative
Training Corpus Strategy

Training Corpus Accuracy
Training Corpus Size Accuracy Margin of Error
Headlines V1 300 headlines 38.89% 1.02%
Headlines V2 600 headlines 47.22% 1.05%
Splunk Twitter 228,000 tweets 40.80% 1.16%
U of Michigan 1.5 million tweets 43.81 1.11%
Movie Reviews 2,000 reviews 36.79% 1.23%

The key to accuracy is the quality of the training data
Train with the same data you will analyze
Size of the training data improves accuracy
Subjectivity of crowd-sourcing tends to even out as the amount of
training data increases
All machine learning tools tend to converge to similar
levels of accuracy
Use the easiest one for you
Conclusions

Experiences with Sentiment Analysis with Peter Zadrozny

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (16)

Similar a Experiences with Sentiment Analysis with Peter Zadrozny

Similar a Experiences with Sentiment Analysis with Peter Zadrozny (20)

Último

Último (20)

Experiences with Sentiment Analysis with Peter Zadrozny