Exploring the similarity between Social Knowledge Sources and Twitter for Cross-Domain Topic Classification of Tweets #KECSM 2012 #ISWC2012

Motivation Research question State-of-the-art Methodology Results Conclusions and Future Work

Exploring the similarity between Social Knowledge Sources and
Twitter for Cross-Domain Topic Classiﬁcation of Tweets

Andrea Varga, Amparo E. Cano and Fabio Ciravegna

1
Organisations Information and Knowledge (OAK) Research Group
University of Shefﬁeld
2
Knowledge Management Institute (KMI)
Open University
KECSM 2012/ISWC 2012

Nov 12, 2012

1/22


Outline

1 Motivation

2 State-of-the-art

3 Methodology

4 Results

5 Conclusions and Future Work

2/22


Why classifying Tweets into topics?

Topic classiﬁcation (TC) of tweets can be important for multiple
application:

Information Retrieval

Recommendation

Emergency responses, etc.

Topic name Example tweets
Disaster&Accident(DisAcc) happening accident people dying could
phone ambulance wakakkaka xd
Entertainment&Culture(EntCult) google adwords commercial greeeat en-
joyed watching greeeeeat day
Politics(Pol) quoting military source sk media reports
deployed rocket launchers decoys real
Sports(Sports) ravens good position games left browns
bengals playoffs

3/22


What are the challenges in Topic Classiﬁcation (TC) of Tweets?

Special characteristics of tweets

the restricted size of a post (limited to 140 characters)

the frequent use of misspellings and jargons

the frequent use of abbreviations

the use of non-standard English: reﬂected in vocabulary and writing style
Entertainment&Culture(EntCult) google commercial greeeat enjoyed
watching day
Politics(Pol) quoting military source media reports de-
ployed rocket launchers decoys real
bengals playoffs

4/22







Politics(Pol) quoting military source media reports de-
ployed rocket launchers decoys real
bengals playoffs

4/22







bengals playoffs

4/22







bengals playoffs

=> These characteristics poses additional challenges for traditional
supervised machine learning approaches for building
accurate TC of tweets
4/22


Why are Social Knowledge Sources (KS) relevant to Twitter?

Data bottleneck problem: investigate an alternative approach inspired
by domain adaptation/transfer learning for exploiting the information from
Social Knowledge Sources (DBpedia and Freebase) for TC of Tweets

Commonalities between KSs and Twitter

they are constantly edited by web users

they are social and built on a collaborative manner

they cover a large number of topics

5/22


Why are Social Knowledge Sources (KS) relevant to Twitter?

Data bottleneck problem: investigate an alternative approach inspired
by domain adaptation/transfer learning for exploiting the information from
Social Knowledge Sources (DBpedia and Freebase) for TC of Tweets

Commonalities between KSs and Twitter

they are constantly edited by web users

they are social and built on a collaborative manner

they cover a large number of topics

More importantly: KSs contain a large number of annotated data on a
large number of topics

5/22


Research questions

1 Are KSs relevant for topic classiﬁcation of Tweets?

2 Which features make the KSs look more similar to Twitter?

3 How similar or dissimilar are KSs to Twitter? Which similarity measure
does better quantify the lexical changes between KSs and Twitter?

6/22


State-of-the-art approaches for TC of Tweets

Using DBpedia for Topic Classiﬁcation of Tweets:
Wikify (Mihalcea, R. and Csomai, A., 2007)
Enriching unstructured text with Wikipedia links (D. Milne and I. H. Witten,
2008)
Tagme (P. Ferragina and U. Scaiella., 2010)
Topical Social Sensor (P. K. P. N. Mendes et al., 2010)
Vector space model (Oscar Munoz-Garcia et al. 2011)

Using Freebase for Topic Classiﬁcation of Tweets:
Clustering based approach (S.P.Kasiviswanathan et al., 2011)

Our main contribution:
Understanding the similarity between KSs and Twitter

Exploring multiple KSs (DBpedia + Freebase)

Investigating various statistical metrics for quantifying the
similarity between KSs and Twitter

7/22


Methodology followed

8/22



1 Collecting Data from KSs

Sc. DB Sc. FB Sc. DB-FB

Retrieve articles Retrieve tweets

Concept Concept
enrichment enrichment

Build Cross-
Annotate Tweets
domain Classiﬁer

8/22




2 Building Cross-Domain (CD) Topic Classiﬁer of Tweets



Concept Concept

Build Cross-
Annotate Tweets
domain Classiﬁer

8/22




2 Building Cross-Domain (CD) Topic Classiﬁer of Tweets

3 Measuring Distributional Changes Between KSs and Twitter



Concept Concept

Build Cross-
Annotate Tweets
domain Classiﬁer

8/22


Step 1: Collecting Data from KSs

Twitter corpus collected in Abel et al. (2011), tweets posted between October 2010 and
Twitter multilabel frequency
January 2011, annotated with 17 topics
Random selection of 1,000 articles/tweets from DBpedia/Freebase/Twitter for each topic =>
9,465 articles from DBpedia; 16,915Freebase multilabel frequency and 12,412 tweets
Dbpedia multilabel frequency articles from Freebase;
Preprocessing: removal of hastags, mentions and URLs from tweets; taking top-1000
71%
features for each topic
Dbpedia multilabel frequency Freebase multilabel frequency Twitter multilabel frequency

71%
88.6%

0.1%
1%
88.6% 99.9% 0.1%
0.9%
1.8% 5.6%

0.9% 99.9% 0.1% 0.1%
1%
1.8%
8.6%
5.6%

8.6%

22.3%
22.3%

1 8 2 3+4+5+6+7+9 1 2
1
1 2
2 3
3 4
4 6+5
6+5
1 8 2 3+4+5+6+7+9 1 2

9/22


Step 1: Collecting Data from KSs

Business_Finance Disaster_Accident Education Entertainment Environment

Health Human Interest Labor Law_Crime Technology_IT

Religion Social Issues Weather Sports War_Conﬂict

Politics

Retrieval of articles for a given topic (e.g. Politics):
from DBpedia: executing SPARQL queries for retrieving category names
containing the topic name:
Category:Politics_of_the_United_States
Category:National_Democratic_Party_Egypt_politicians
etc.

from Freebase: accessing Text Service API for articles belonging to the
topic:
for underspeciﬁed topics/domains: consider articles containing the topic in their
titles

10/22


Step 2: Building Cross-Domain (CD) Topic Classiﬁer of Tweets

Considering two different feature sets:

11/22



BOW: tf.idf value of the words present the examples (articles or tweets)

11/22



BOW: tf.idf value of the words present the examples (articles or tweets)

BOE: tf.idf value of the words and entity+concept pairs present the examples
(articles or tweets)



Concept Concept

Build Cross-
Annotate Tweets
domain Classiﬁer

11/22


Step 3: Measuring Distributional Changes Between KSs and Twitter
−
→
Building a vector ds for each the source dataset (Sc.DB, Sc.Fb,
−
→
Sc.Db-FB) and a vector dt for the target dataset (Twitter) consisting of
the TF-IDF weight for either the BoW or BoE feature sets

statistical measures applied:
(O−E)2
χ2 test: χ2 = E
, where O is the observed value for a feature, while
E is the expected value calculated on the basis of the joint corpus

Kullback-Leibler symmetric distance:
− −
→ → −
→ −
→ →
−
ds (f )
KL(ds || dt ) = f ∈F ∪FT (ds (f ) − dt (f )) log →
−
S dt (f )
FS ∪FT →
− →
−
− −
→ → k =1
( ds (fS )× ds (fT ))
k k
cosine similarity: cosine(ds , dt ) = FS ∪FT →
− →
−
( ds (fS )) 2 × FS ∪FT ( d (f ))2
k =1 k k =1 kt T

12/22


Experimental setting

1-vs-all approach, building individual CD classifier for each topic, SVM
classifiers, performed 5 cross-fold validation

Sc-Db, Sc-Fb, Sc-Db-Fb classifiers trained on full KS data, evaluated on
20% Twitter data 2,482 tweets)

TGT classifier: trained on 80% Twitter data, evaluated on 20% Twitter
data (2,482 tweets)

13/22


Findings -Classiﬁcation performance in F1 measure

Q1 : Which KS reﬂects better the lexical variation in Twitter?

14/22




21.40 21.20 11.50 15.30 18.60 19.10 47.50 46.50 BusFi

15.30 15.10 15.50 16.20 19.50 20.20 42.20 43.10 EntCult

23.40 25.10 14.40 14.70 21.00 20.40 58.40 58.50 Religion

28.60 30.30 25.60 24.70 26.90 25.40 51.70 51.80 Health

22.20 21.00 27.80 26.80 24.20 26.80 45.10 44.90 Pol

0.90 2.70 16.80 17.80 14.80 13.30 46.80 46.40 Law

1.40 2.30 17.20 19.50 11.30 13.90 41.60 42.60 HospRecr

1.30 2.00 8.80 9.00 9.70 9.10 44.20 44.00 SocIssue

8.30 9.70 14.50 14.50 21.00 21.10 57.50 59.20 DisAcc

1.60 2.40 18.40 18.60 12.40 9.90 57.40 58.00 TechIT

15.20 14.20 2.20 2.20 8.90 8.40 46.60 48.30 Env

1.10 1.40 2.00 1.60 1.50 2.20 33.60 34.20 HumInt

3.10 7.00 39.80 39.90 36.70 36.00 81.20 81.50 Weather

1.40 1.30 31.90 31.90 30.10 29.90 79.90 79.40 Labor

9.30 10.90 23.90 23.60 24.60 25.70 67.60 72.70 War

10.30 11.70 26.50 26.20 26.20 26.00 60.10 59.20 Sports

37.40 37.80 42.50 47.20 42.50 45.70 71.90 71.30 Edu
Sc.DB(BoE)

Sc.DB(BoW)

Sc.FB(BoE)

SC.FB(BoW)

Sc.DB−FB(BoE)

SC.DB−FB(BoW)

TGT(BoW)

TGT(BoE)

14/22




Sc.Db-FB showed best performance, followed by Sc.Fb and Sc.Db

21.40 21.20 11.50 15.30 18.60 19.10 47.50 46.50 BusFi

15.30 15.10 15.50 16.20 19.50 20.20 42.20 43.10 EntCult

23.40 25.10 14.40 14.70 21.00 20.40 58.40 58.50 Religion

28.60 30.30 25.60 24.70 26.90 25.40 51.70 51.80 Health

22.20 21.00 27.80 26.80 24.20 26.80 45.10 44.90 Pol

0.90 2.70 16.80 17.80 14.80 13.30 46.80 46.40 Law

1.40 2.30 17.20 19.50 11.30 13.90 41.60 42.60 HospRecr

1.30 2.00 8.80 9.00 9.70 9.10 44.20 44.00 SocIssue

8.30 9.70 14.50 14.50 21.00 21.10 57.50 59.20 DisAcc

1.60 2.40 18.40 18.60 12.40 9.90 57.40 58.00 TechIT

15.20 14.20 2.20 2.20 8.90 8.40 46.60 48.30 Env

1.10 1.40 2.00 1.60 1.50 2.20 33.60 34.20 HumInt

3.10 7.00 39.80 39.90 36.70 36.00 81.20 81.50 Weather

1.40 1.30 31.90 31.90 30.10 29.90 79.90 79.40 Labor

9.30 10.90 23.90 23.60 24.60 25.70 67.60 72.70 War

10.30 11.70 26.50 26.20 26.20 26.00 60.10 59.20 Sports

37.40 37.80 42.50 47.20 42.50 45.70 71.90 71.30 Edu
Sc.DB(BoE)

Sc.DB(BoW)

Sc.FB(BoE)

SC.FB(BoW)

Sc.DB−FB(BoE)

SC.DB−FB(BoW)

TGT(BoW)

TGT(BoE)

15/22



Q2 : What feature makes the KSs look more similar to Twitter?

BoW features were found better than BoE for CD classiﬁers
BoE features were found better than BoW for TGT

21.40 21.20 11.50 15.30 18.60 19.10 47.50 46.50 BusFi

15.30 15.10 15.50 16.20 19.50 20.20 42.20 43.10 EntCult

23.40 25.10 14.40 14.70 21.00 20.40 58.40 58.50 Religion

28.60 30.30 25.60 24.70 26.90 25.40 51.70 51.80 Health

22.20 21.00 27.80 26.80 24.20 26.80 45.10 44.90 Pol

0.90 2.70 16.80 17.80 14.80 13.30 46.80 46.40 Law

1.40 2.30 17.20 19.50 11.30 13.90 41.60 42.60 HospRecr

1.30 2.00 8.80 9.00 9.70 9.10 44.20 44.00 SocIssue

8.30 9.70 14.50 14.50 21.00 21.10 57.50 59.20 DisAcc

1.60 2.40 18.40 18.60 12.40 9.90 57.40 58.00 TechIT

15.20 14.20 2.20 2.20 8.90 8.40 46.60 48.30 Env

1.10 1.40 2.00 1.60 1.50 2.20 33.60 34.20 HumInt

3.10 7.00 39.80 39.90 36.70 36.00 81.20 81.50 Weather

1.40 1.30 31.90 31.90 30.10 29.90 79.90 79.40 Labor

9.30 10.90 23.90 23.60 24.60 25.70 67.60 72.70 War

10.30 11.70 26.50 26.20 26.20 26.00 60.10 59.20 Sports

37.40 37.80 42.50 47.20 42.50 45.70 71.90 71.30 Edu
Sc.DB(BoE)

Sc.DB(BoW)

Sc.FB(BoE)

SC.FB(BoW)

Sc.DB−FB(BoE)

SC.DB−FB(BoW)

TGT(BoW)

TGT(BoE)

16/22


Findings -Examining the number of annotation needed for Twitter
classifier to outperform Sc. Db-FB

Investigated the impact of employing Sc. Db-FB classifier over the
Twitter classifier in terms of number of annotations

The performance of the Twitter classifier against the three CD classifiers
over the full learning curve

17/22


classifier to outperform Sc. Db-FB

Investigated the impact of employing Sc. Db-FB classifier over the
Twitter classifier in terms of number of annotations

The performance of the Twitter classifier against the three CD classifiers
over the full learning curve

=> In the absence of any annotated tweets, applying these CD
classifiers are beneficial

17/22


classifier to outperform the CD classifiers

Q3 : How similar or dissimilar are KSs to Twitter posts; and which
similarity measure does better reflect the lexical changes between KSs
and Twitter posts?

Compared χ2 , KL-divergence, cosine for each topic

χ2 obtained the best correlation with the performance of CD classifiers,
achived scores >70% for 32 cases

cosine obtained correlation scores >70% for 25 cases

KL obtained correlation scores >70% for 24 cases

18/22


classifier to outperform the CD classifiers

Q3 : How similar or dissimilar are KSs to Twitter posts; and which
similarity measure does better reflect the lexical changes between KSs
and Twitter posts?

Compared χ2 , KL-divergence, cosine for each topic

χ2 obtained the best correlation with the performance of CD classifiers,
achived scores >70% for 32 cases

cosine obtained correlation scores >70% for 25 cases

KL obtained correlation scores >70% for 24 cases

=> χ2 test is the best measure for quantifying the distributional
differences between KSs and Twitter.

18/22


Conclusions and Future Work

We presented a ﬁrst study towards understanding the usefulness of KSs in TC of
tweets at various granularities: lexical features (BoW) and entity features (BoE)

Our main ﬁndings are:

19/22




In the absence of any annotated tweets, applying these CD classiﬁers are beneﬁcial

19/22





Out of the two KSs, Freebase topics seem to be much closer to the Twitter topics than
the DBpedia topics.

19/22





the DBpedia topics.

The two KSs contain complementary information

19/22





the DBpedia topics.


For the CD classiﬁers, on average BOW features were more useful than BoE features

19/22





the DBpedia topics.



For the Twitter classiﬁers, on average BOE features were more useful than BoW
features

19/22





the DBpedia topics.



features

We found χ2 test as being the best measure for quantifying the distributional

19/22





the DBpedia topics.



features

We found χ2 test as being the best measure for quantifying the distributional

Our future work will focus on building more accurate TC classiﬁers and
investigating better measures

19/22


Corpus Analysis - Size of vocabulary

'%!!!" 'J!I'"
'I&'L"
'$!!!" 'I!$L" 'IIIJ"
'#$#J" '#J!K"
''L%L" ''L$I"
'#!!!" ''J#I"
'''!!"
'!K#'" '!$IJ"
'!!L#" '!#'#" '!!I'" '!!#L" LLII"
LJJ$" LI!K" L&#I" LK'&" L&KJ"
'!!!!" LI$#" L##'" LI$#" L$##" LI$#" LI$#" LI$#" LI$#"
&K%&" &&J&"
&'&I"
GMG"
&!!!" K$'#"
%JK'" %&$#" %K#&" F/N-("
%$'#"
JK%#" J&&&"
%!!!" F/N+("
$%&$" $JJ'"
I&!I" I&%!" F/N-(O+("
IJ%'" I%&!" IK!K" I$K!" I$L$" IJ#I" I$!I" IJL'"
$!!!" IIJ$" IIJI" I#%J" I'!!" I#&#" I#JK" II%I" IIK!" I'$#"
#'&L" ##J$" #$#K" #KJ!"
'&#J" 'J%&" '&LK"
#!!!"

!"
,"

"

"

3"

6"

"

"

3"

"

2"

"

"

G"

"
B"

"
"
//

)

3:

8/

5

)8

3*

8B

9B
C
*+

D;
?2
)5

A;
02

?
01

,;

;B

/:
@9
*.

H
95

3:
<=

*
()

34

>

5,E

/?*
@9
78

F<

G8

89
-,

7)
02

*

=8
7;

F;

H
20/22


Understanding the results - Number of unique entities

Examining the number of entities in the source (Sc. DB, Sc. FB, Sc. DB-FB) and
target (TGT) datasets after pre-processing.
the TGT dataset consists of 1.73 ± 0.35 entities/tweet
the Sc.DB dataset consists of 22.24 ± 1.44 entities entities/article
the Sc.FB dataset consists of 8.14 ± 5.78 entities entities/article

'#!!!" '%&KI"
&I%JK"
'!!!!" &K&!!"
&LL'$"

&#!!!"
&$LJ%"
%I$J&" %IIJK" %I&$!" %III%"
&!!!!" %K!K#"
%'#%'" %#%&L" %'&'#" %#I!J" %'#KK" %'KK$"
%'#IL" %'J&&"
%#!!!" %&'#!" %&L'$"
%%&K&" %%!IL" %%%J'" %%&K&" %%#K%" %%&K&" %%&K&" %%&K&" %%%&I" %%&K&" GMG"
%$$#L" %!LI%" %!JII"
$I%&!" $J'K&" F/N-("
%!!!!" $KIK'"
$'L&L" $'$#&" F/N+("
$#!!!" $%!#&" F/N-(O+("
IJ#$"
$!!!!" KL%%" KIK%" KIKK"
L'!$" #K#J"
#!!!" $II%" %&$L" $I$&" %&J#" %'%'" %&$%" %''$" %J%I"
%%JL"
$JKK" $J!#" $J!#" $JK&" $'K$" $%J&" $#I$" $'#&" $ILI" $$%!" $#$%" %$$L" $KJ&" $%!L"
!"
,"

"

"

3"

6"

"

/"

"

"
3"

"

2"

G"

"
B"

"
"
//

)

3:

)8

3*
5

8B

9B
C
*+

D;
)5

?2
=8

A;
02
01

?
,;

;B

/:
@9
*.

95

H
3:
*
()

34

>

5,E

/?*
*<

@9
78

F<

G8

89
-,

7)
02

=8
7;

F;

H
21/22

Exploring the similarity between Social Knowledge Sources and Twitter for Cross-Domain Topic Classification of Tweets #KECSM 2012 #ISWC2012

Exploring the similarity between Social Knowledge Sources and Twitter for Cross-Domain Topic Classification of Tweets #KECSM 2012 #ISWC2012

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (7)

Exploring the similarity between Social Knowledge Sources and Twitter for Cross-Domain Topic Classification of Tweets #KECSM 2012 #ISWC2012