Quantifying Text Sentiment in R

Happy,
Sad,
Indiﬀerent
…

Quan3fying
Text
Sen3ment
in
R

Rajarshi
Guha

CT
R
Users
Group

May
2012

Preamble

•  hHps://github.com/rajarshi/ctrug-‐tweet

•  Focus
is
on
using
R
to
perform
this
task

•  Won’t
comment
on
validity,
rigor,
u3lity,
…
of

sen3ment
analysis
methods

•  Some
of
the
example
data
is
available
freely,

other
parts
available
on
request

GeUng
TwiHer
Data

•  Based
on
a
collabora3on
with
Prof.
Debs
Ghosh

(Uconn),
studying
obesity
&
social
media

•  Accessing
TwiHer
is
easy
using
many
languages

–  We
obtained
tweets
via
a
PHP
client
running
over
an

extended
period
of
3me

–  Ended
up
with
108,164
tweets

•  Won’t
focus
on
accessing
TwiHer
data
from
R

–  Very
straighaorward
with
twitteR

Cleaning
Text

•  Load
in
tweet
data,
get
rid
of
urls,
HTML

escape
codes,
punctua3on
etc

d
<-‐
read.csv('pizza-‐unique.csv',
colClass='character',

comment='',
header=TRUE)

d$geox
<-‐
as.numeric(d$geox)

d$geoy
<-‐
as.numeric(d$geoy)

remove.urls
<-‐
function(x)
gsub("http.*$",
"",
gsub('http.*s',
'
',
x))

remove.html
<-‐
function(x)
gsub('"',
'',
x)

d$text
<-‐
remove.urls(d$text)

d$text
<-‐
remove.html(d$text)

d$text
<-‐
gsub("@",
"FOOBAZ",
d$text)

d$text
<-‐
gsub("[[:punct:]]+",
"
",
d$text)

d$text
<-‐
gsub("FOOBAZ",
"@",
d$text)

d$text
<-‐
gsub("[[:space:]]+",
'
',
d$text)

d$text
<-‐
tolower(d$text)

Quan3fying
Sen3ment

•  Based
on
iden3fying
words
with
posi3ve
or

nega3ve
connota3ons

•  Fundamentally
based
on
looking
up
words

from
a
dic3onary

•  If
a
tweet
has
more
posi3ve
words
than

nega3ve
words,
the
tweet
is
posi3ve

•  More
sophis3cated
scoring
schemes
are

possible

BeHer
Dic3onaries?

•  Sen3WordNet

–  Derived
from
WordNet,
each
term
is
assigned
a

posi3vity
and
nega3vity
score

–  206K
terms

–  Converted
to
simple

1.0

CSV
for
easy
import

0.8

into
R
0.6 Sentiment
Proportion

•  Ideally,
should

negative
neutral
positive
0.4

perform
POS
tagging

0.2

0.0

adjective adverb noun verb

Scoring
Tweets

•  Given
a
scoring
func3on,
we
can
process
the

tweets
swn
<-‐
read.csv('sentinet_r.csv',
header=TRUE,

–  Perfect
use

as.is=TRUE)

case
for

swn.match
<-‐
function(w)
{

parallel

tmp
<-‐
subset(swn,
Term
==
w)

if
(nrow(tmp)
>=
1)
return(tmp[1,c(3,4)])

processing

else
return(c(0,0))

}

–  Easily
switch

out
the

score.swn
<-‐
function(tweet)
{

words
<-‐
strsplit(tweet,
"s+")[[1]]

scoring

cs
<-‐
colSums(do.call('rbind',

func3on

lapply(words,
function(z)

swn.match(z))))

return(cs[1]-‐cs[2])

}

scores
<-‐
mclapply(d$text,
score.swn)

Proﬁling
Makes
Me
Happy

swn.match
<-‐
function(w)
{

•  6052
sec
with

tmp
<-‐
subset(swn,
Term
==
w)

if
(nrow(tmp)
>=
1)
return(tmp[1,c(3,4)])

24
cores

else
return(c(0,0))

}

•  Rprof()
is
a

score.swn
<-‐
function(tweet)
{

words
<-‐
strsplit(tweet,
"s+")[[1]]

good
way
to

cs
<-‐
colSums(do.call('rbind',

lapply(words,
function(z)

swn.match(z))))

iden3fy


boHlenecks*

}

score.swn.2
<-‐
function(tweet)
{

•  461
sec
with

words
<-‐
strsplit(tweet,
"s+")[[1]]

rows
<-‐
match(words,
swn$Term)

24
cores

rows
<-‐
rows[!is.na(rows)]

cs
<-‐
colSums(swn[rows,c(3,4)])


}

*
overkill
for
this
example

Looking
at
the
Scores

•  Bulk
of
the
tweets
2.5

are
neutral
2.0

•  Similar
behavior
Method

density
1.5
SWN

from
either

Breen

1.0

scoring
func3on
0.5

0.0

-6 -4 -2 0 2 4 6
Sentiment Scores
d$swn
<-‐
unlist(scores.swn)

d$breen
<-‐
unlist(scores.breen)

tmp
<-‐
rbind(data.frame(Method='SWN',
Scores=d$swn),

data.frame(Method='Breen',
Scores=d$breen))

ggplot(tmp,
aes(x=Scores,
fill=Method))
+

geom_density(alpha=0.25)
+

xlab("Sentiment
Scores")

Sen3ment
&
Time
of
Day

•  Group
tweets
by
hour
and
evaluate
how

propor3ons
of
posi3ve,
nega3ve,
etc
vary
.

tmp
<-‐
d

tmp$hour
<-‐
strptime(d$time,
format='%a,
%d
%b
%Y
%H:%M')$hour

tmp
<-‐
subset(tmp,
!is.na(swn))

tmp$status
<-‐
sapply(tmp$swn,
function(x)
{

if
(x
>
0)
return("Positive")

else
if
(x
<
0)
return("Negative")

else
return("Neutral")

})

tmp
<-‐
data.frame(do.call('rbind',

by(tmp,
tmp$hour,
function(x)
table(x$status))))

tmp$Hour
<-‐
factor(rownames(tmp),
levels=0:23)

tmp
<-‐
melt(tmp,
id='Hour',
variable_name='Sentiment')

ggplot(tmp,
aes(x=Hour,y=value,fill=Sentiment))+geom_bar(position='fill')+

xlab("")+ylab("Proportion")

Sen3ment
&
Time
of
Day

1.0

0.8

0.6 Sentiment
Proportion

Negative
Neutral

0.4 Positive

0.2

0.0

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

Contradic3ons?

•  Tweets
that
are
nega3ve
according
to
one

score
but
posi3ve
according
to
another

subset(d,
swn
<
-‐2
&
breen
>
1)

"i
m
trying
to
get
some
legit
food
right
now
like
pizza
or
chicken
not
this
shi7y
ass
school
lunch”

"24
i
like
reading
25
i
hate
hopsin
26
i
love
chips
salsa
27
i
love
chevys
28
i

was
a
thug
in
middle
school
29
i
love
pizza”

"@naturesempwm
had
a
raw
pizza
4
lunch
today
but
i
was
not
impressed
with
the
dried
out

not
fresh
vegetable
spring
roll
i
bought
threw
out
"

Sen3ment
and
Geography

•  What’s
the
spa3al
distribu3on
of
tweet

sen3ment?

•  Extract
tweets
located
in
the
CONUS
(~
500)

•  Visualize
the
direc3on
and
strength
of

sen3ments
swn

•  Correlate
with

-1
0
1

other
socio-‐

2

abs(swn)

economic
factors?

0.0
0.5
1.0
1.5
2.0

Other
Considera3ons

•  Should
take
into
account
nega3on

–  Scan
for
nega3on
terms
and
adjust
score

appropriately

•  Oblivious
to
sarcasm

•  Sen3ment
scores
should
probably
be
modiﬁed

by
context

•  Lots
of
M/L
opportuni3es

–  Spa3al
analysis

–  Topic
modeling
/
clustering

–  Predic3ve
models

Quantifying Text Sentiment in R

Recomendados

Recomendados

Más contenido relacionado

Destacado

Destacado (20)

Más de Rajarshi Guha

Más de Rajarshi Guha (20)

Último

Último (20)

Quantifying Text Sentiment in R