The rapid rate of information propagation on social streams has proven to be an up-to-date channel of communication, which can reveal events happening in the world. However, identifying the topicality of short messages (e.g. tweets) distributed on these streams poses new challenges in the development of accurate classification algorithms.
In order to alleviate this problem we study for the first time a transfer learning setting aiming to make use of two frequently updated social knowledge sources KSs (DBpedia and Freebase) for detecting topics in tweets. In this paper we investigate the similarity (and dissimilarity) between these KSs and Twitter at the lexical and conceptual (entity) level. We also evaluate the contribution of these types of features and propose various statistical measures for determining the topics which are highly similar or different in KSs and tweets.
Our findings can be of potential use to machine learning or domain adaptation algorithms aiming to use named entities for topic classification of tweets. These results can also be valuable in the identification of representative sets of annotated articles from the KSs, which can help in building accurate topic classifiers of tweets.
Exploring the similarity between Social Knowledge Sources and Twitter for Cross-Domain Topic Classification of Tweets #KECSM 2012 #ISWC2012
1. Motivation Research question State-of-the-art Methodology Results Conclusions and Future Work
Exploring the similarity between Social Knowledge Sources and
Twitter for Cross-Domain Topic Classification of Tweets
Andrea Varga, Amparo E. Cano and Fabio Ciravegna
1
Organisations Information and Knowledge (OAK) Research Group
University of Sheffield
2
Knowledge Management Institute (KMI)
Open University
KECSM 2012/ISWC 2012
Nov 12, 2012
1/22
2. Motivation Research question State-of-the-art Methodology Results Conclusions and Future Work
Outline
1 Motivation
2 State-of-the-art
3 Methodology
4 Results
5 Conclusions and Future Work
2/22
3. Motivation Research question State-of-the-art Methodology Results Conclusions and Future Work
Why classifying Tweets into topics?
Topic classification (TC) of tweets can be important for multiple
application:
Information Retrieval
Recommendation
Emergency responses, etc.
Topic name Example tweets
Disaster&Accident(DisAcc) happening accident people dying could
phone ambulance wakakkaka xd
Entertainment&Culture(EntCult) google adwords commercial greeeat en-
joyed watching greeeeeat day
Politics(Pol) quoting military source sk media reports
deployed rocket launchers decoys real
Sports(Sports) ravens good position games left browns
bengals playoffs
3/22
4. Motivation Research question State-of-the-art Methodology Results Conclusions and Future Work
What are the challenges in Topic Classification (TC) of Tweets?
Special characteristics of tweets
the restricted size of a post (limited to 140 characters)
the frequent use of misspellings and jargons
the frequent use of abbreviations
the use of non-standard English: reflected in vocabulary and writing style
Topic name Example tweets
Disaster&Accident(DisAcc) happening accident people dying could
phone ambulance wakakkaka xd
Entertainment&Culture(EntCult) google commercial greeeat enjoyed
watching day
Politics(Pol) quoting military source media reports de-
ployed rocket launchers decoys real
Sports(Sports) ravens good position games left browns
bengals playoffs
4/22
5. Motivation Research question State-of-the-art Methodology Results Conclusions and Future Work
What are the challenges in Topic Classification (TC) of Tweets?
Special characteristics of tweets
the restricted size of a post (limited to 140 characters)
the frequent use of misspellings and jargons
the frequent use of abbreviations
the use of non-standard English: reflected in vocabulary and writing style
Topic name Example tweets
Disaster&Accident(DisAcc) happening accident people dying could
phone ambulance wakakkaka xd
Entertainment&Culture(EntCult) google adwords commercial greeeat en-
joyed watching greeeeeat day
Politics(Pol) quoting military source media reports de-
ployed rocket launchers decoys real
Sports(Sports) ravens good position games left browns
bengals playoffs
4/22
6. Motivation Research question State-of-the-art Methodology Results Conclusions and Future Work
What are the challenges in Topic Classification (TC) of Tweets?
Special characteristics of tweets
the restricted size of a post (limited to 140 characters)
the frequent use of misspellings and jargons
the frequent use of abbreviations
the use of non-standard English: reflected in vocabulary and writing style
Topic name Example tweets
Disaster&Accident(DisAcc) happening accident people dying could
phone ambulance wakakkaka xd
Entertainment&Culture(EntCult) google adwords commercial greeeat en-
joyed watching greeeeeat day
Politics(Pol) quoting military source sk media reports
deployed rocket launchers decoys real
Sports(Sports) ravens good position games left browns
bengals playoffs
4/22
7. Motivation Research question State-of-the-art Methodology Results Conclusions and Future Work
What are the challenges in Topic Classification (TC) of Tweets?
Special characteristics of tweets
the restricted size of a post (limited to 140 characters)
the frequent use of misspellings and jargons
the frequent use of abbreviations
the use of non-standard English: reflected in vocabulary and writing style
Topic name Example tweets
Disaster&Accident(DisAcc) happening accident people dying could
phone ambulance wakakkaka xd
Entertainment&Culture(EntCult) google adwords commercial greeeat en-
joyed watching greeeeeat day
Politics(Pol) quoting military source sk media reports
deployed rocket launchers decoys real
Sports(Sports) ravens good position games left browns
bengals playoffs
4/22
8. Motivation Research question State-of-the-art Methodology Results Conclusions and Future Work
What are the challenges in Topic Classification (TC) of Tweets?
Special characteristics of tweets
the restricted size of a post (limited to 140 characters)
the frequent use of misspellings and jargons
the frequent use of abbreviations
the use of non-standard English: reflected in vocabulary and writing style
Topic name Example tweets
Disaster&Accident(DisAcc) happening accident people dying could
phone ambulance wakakkaka xd
Entertainment&Culture(EntCult) google adwords commercial greeeat en-
joyed watching greeeeeat day
Politics(Pol) quoting military source sk media reports
deployed rocket launchers decoys real
Sports(Sports) ravens good position games left browns
bengals playoffs
=> These characteristics poses additional challenges for traditional
supervised machine learning approaches for building
accurate TC of tweets
4/22
9. Motivation Research question State-of-the-art Methodology Results Conclusions and Future Work
Why are Social Knowledge Sources (KS) relevant to Twitter?
Data bottleneck problem: investigate an alternative approach inspired
by domain adaptation/transfer learning for exploiting the information from
Social Knowledge Sources (DBpedia and Freebase) for TC of Tweets
Commonalities between KSs and Twitter
they are constantly edited by web users
they are social and built on a collaborative manner
they cover a large number of topics
5/22
10. Motivation Research question State-of-the-art Methodology Results Conclusions and Future Work
Why are Social Knowledge Sources (KS) relevant to Twitter?
Data bottleneck problem: investigate an alternative approach inspired
by domain adaptation/transfer learning for exploiting the information from
Social Knowledge Sources (DBpedia and Freebase) for TC of Tweets
Commonalities between KSs and Twitter
they are constantly edited by web users
they are social and built on a collaborative manner
they cover a large number of topics
5/22
11. Motivation Research question State-of-the-art Methodology Results Conclusions and Future Work
Why are Social Knowledge Sources (KS) relevant to Twitter?
Data bottleneck problem: investigate an alternative approach inspired
by domain adaptation/transfer learning for exploiting the information from
Social Knowledge Sources (DBpedia and Freebase) for TC of Tweets
Commonalities between KSs and Twitter
they are constantly edited by web users
they are social and built on a collaborative manner
they cover a large number of topics
5/22
12. Motivation Research question State-of-the-art Methodology Results Conclusions and Future Work
Why are Social Knowledge Sources (KS) relevant to Twitter?
Data bottleneck problem: investigate an alternative approach inspired
by domain adaptation/transfer learning for exploiting the information from
Social Knowledge Sources (DBpedia and Freebase) for TC of Tweets
Commonalities between KSs and Twitter
they are constantly edited by web users
they are social and built on a collaborative manner
they cover a large number of topics
5/22
13. Motivation Research question State-of-the-art Methodology Results Conclusions and Future Work
Why are Social Knowledge Sources (KS) relevant to Twitter?
Data bottleneck problem: investigate an alternative approach inspired
by domain adaptation/transfer learning for exploiting the information from
Social Knowledge Sources (DBpedia and Freebase) for TC of Tweets
Commonalities between KSs and Twitter
they are constantly edited by web users
they are social and built on a collaborative manner
they cover a large number of topics
More importantly: KSs contain a large number of annotated data on a
large number of topics
5/22
14. Motivation Research question State-of-the-art Methodology Results Conclusions and Future Work
Research questions
1 Are KSs relevant for topic classification of Tweets?
2 Which features make the KSs look more similar to Twitter?
3 How similar or dissimilar are KSs to Twitter? Which similarity measure
does better quantify the lexical changes between KSs and Twitter?
6/22
15. Motivation Research question State-of-the-art Methodology Results Conclusions and Future Work
State-of-the-art approaches for TC of Tweets
Using DBpedia for Topic Classification of Tweets:
Wikify (Mihalcea, R. and Csomai, A., 2007)
Enriching unstructured text with Wikipedia links (D. Milne and I. H. Witten,
2008)
Tagme (P. Ferragina and U. Scaiella., 2010)
Topical Social Sensor (P. K. P. N. Mendes et al., 2010)
Vector space model (Oscar Munoz-Garcia et al. 2011)
Using Freebase for Topic Classification of Tweets:
Clustering based approach (S.P.Kasiviswanathan et al., 2011)
Our main contribution:
Understanding the similarity between KSs and Twitter
Exploring multiple KSs (DBpedia + Freebase)
Investigating various statistical metrics for quantifying the
similarity between KSs and Twitter
7/22
16. Motivation Research question State-of-the-art Methodology Results Conclusions and Future Work
Methodology followed
8/22
17. Motivation Research question State-of-the-art Methodology Results Conclusions and Future Work
Methodology followed
1 Collecting Data from KSs
Sc. DB Sc. FB Sc. DB-FB
Retrieve articles Retrieve tweets
Concept Concept
enrichment enrichment
Build Cross-
Annotate Tweets
domain Classifier
8/22
18. Motivation Research question State-of-the-art Methodology Results Conclusions and Future Work
Methodology followed
1 Collecting Data from KSs
2 Building Cross-Domain (CD) Topic Classifier of Tweets
Sc. DB Sc. FB Sc. DB-FB
Retrieve articles Retrieve tweets
Concept Concept
enrichment enrichment
Build Cross-
Annotate Tweets
domain Classifier
8/22
19. Motivation Research question State-of-the-art Methodology Results Conclusions and Future Work
Methodology followed
1 Collecting Data from KSs
2 Building Cross-Domain (CD) Topic Classifier of Tweets
3 Measuring Distributional Changes Between KSs and Twitter
Sc. DB Sc. FB Sc. DB-FB
Retrieve articles Retrieve tweets
Concept Concept
enrichment enrichment
Build Cross-
Annotate Tweets
domain Classifier
8/22
20. Motivation Research question State-of-the-art Methodology Results Conclusions and Future Work
Step 1: Collecting Data from KSs
Twitter corpus collected in Abel et al. (2011), tweets posted between October 2010 and
Twitter multilabel frequency
January 2011, annotated with 17 topics
Random selection of 1,000 articles/tweets from DBpedia/Freebase/Twitter for each topic =>
9,465 articles from DBpedia; 16,915Freebase multilabel frequency and 12,412 tweets
Dbpedia multilabel frequency articles from Freebase;
Preprocessing: removal of hastags, mentions and URLs from tweets; taking top-1000
71%
features for each topic
Dbpedia multilabel frequency Freebase multilabel frequency Twitter multilabel frequency
71%
88.6%
0.1%
1%
88.6% 99.9% 0.1%
0.9%
1.8% 5.6%
0.9% 99.9% 0.1% 0.1%
1%
1.8%
8.6%
5.6%
8.6%
22.3%
22.3%
1 8 2 3+4+5+6+7+9 1 2
1
1 2
2 3
3 4
4 6+5
6+5
1 8 2 3+4+5+6+7+9 1 2
9/22
21. Motivation Research question State-of-the-art Methodology Results Conclusions and Future Work
Step 1: Collecting Data from KSs
Business_Finance Disaster_Accident Education Entertainment Environment
Health Human Interest Labor Law_Crime Technology_IT
Religion Social Issues Weather Sports War_Conflict
Politics
Retrieval of articles for a given topic (e.g. Politics):
from DBpedia: executing SPARQL queries for retrieving category names
containing the topic name:
Category:Politics_of_the_United_States
Category:National_Democratic_Party_Egypt_politicians
etc.
from Freebase: accessing Text Service API for articles belonging to the
topic:
for underspecified topics/domains: consider articles containing the topic in their
titles
10/22
22. Motivation Research question State-of-the-art Methodology Results Conclusions and Future Work
Step 2: Building Cross-Domain (CD) Topic Classifier of Tweets
Considering two different feature sets:
11/22
23. Motivation Research question State-of-the-art Methodology Results Conclusions and Future Work
Step 2: Building Cross-Domain (CD) Topic Classifier of Tweets
Considering two different feature sets:
BOW: tf.idf value of the words present the examples (articles or tweets)
11/22
24. Motivation Research question State-of-the-art Methodology Results Conclusions and Future Work
Step 2: Building Cross-Domain (CD) Topic Classifier of Tweets
Considering two different feature sets:
BOW: tf.idf value of the words present the examples (articles or tweets)
BOE: tf.idf value of the words and entity+concept pairs present the examples
(articles or tweets)
Sc. DB Sc. FB Sc. DB-FB
Retrieve articles Retrieve tweets
Concept Concept
enrichment enrichment
Build Cross-
Annotate Tweets
domain Classifier
11/22
25. Motivation Research question State-of-the-art Methodology Results Conclusions and Future Work
Step 3: Measuring Distributional Changes Between KSs and Twitter
−
→
Building a vector ds for each the source dataset (Sc.DB, Sc.Fb,
−
→
Sc.Db-FB) and a vector dt for the target dataset (Twitter) consisting of
the TF-IDF weight for either the BoW or BoE feature sets
statistical measures applied:
(O−E)2
χ2 test: χ2 = E
, where O is the observed value for a feature, while
E is the expected value calculated on the basis of the joint corpus
Kullback-Leibler symmetric distance:
− −
→ → −
→ −
→ →
−
ds (f )
KL(ds || dt ) = f ∈F ∪FT (ds (f ) − dt (f )) log →
−
S dt (f )
FS ∪FT →
− →
−
− −
→ → k =1
( ds (fS )× ds (fT ))
k k
cosine similarity: cosine(ds , dt ) = FS ∪FT →
− →
−
( ds (fS )) 2 × FS ∪FT ( d (f ))2
k =1 k k =1 kt T
12/22
26. Motivation Research question State-of-the-art Methodology Results Conclusions and Future Work
Experimental setting
1-vs-all approach, building individual CD classifier for each topic, SVM
classifiers, performed 5 cross-fold validation
Sc-Db, Sc-Fb, Sc-Db-Fb classifiers trained on full KS data, evaluated on
20% Twitter data 2,482 tweets)
TGT classifier: trained on 80% Twitter data, evaluated on 20% Twitter
data (2,482 tweets)
13/22
27. Motivation Research question State-of-the-art Methodology Results Conclusions and Future Work
Findings -Classification performance in F1 measure
Q1 : Which KS reflects better the lexical variation in Twitter?
14/22
35. Motivation Research question State-of-the-art Methodology Results Conclusions and Future Work
Findings -Classification performance in F1 measure
Q2 : What feature makes the KSs look more similar to Twitter?
BoW features were found better than BoE for CD classifiers
BoE features were found better than BoW for TGT
21.40 21.20 11.50 15.30 18.60 19.10 47.50 46.50 BusFi
15.30 15.10 15.50 16.20 19.50 20.20 42.20 43.10 EntCult
23.40 25.10 14.40 14.70 21.00 20.40 58.40 58.50 Religion
28.60 30.30 25.60 24.70 26.90 25.40 51.70 51.80 Health
22.20 21.00 27.80 26.80 24.20 26.80 45.10 44.90 Pol
0.90 2.70 16.80 17.80 14.80 13.30 46.80 46.40 Law
1.40 2.30 17.20 19.50 11.30 13.90 41.60 42.60 HospRecr
1.30 2.00 8.80 9.00 9.70 9.10 44.20 44.00 SocIssue
8.30 9.70 14.50 14.50 21.00 21.10 57.50 59.20 DisAcc
1.60 2.40 18.40 18.60 12.40 9.90 57.40 58.00 TechIT
15.20 14.20 2.20 2.20 8.90 8.40 46.60 48.30 Env
1.10 1.40 2.00 1.60 1.50 2.20 33.60 34.20 HumInt
3.10 7.00 39.80 39.90 36.70 36.00 81.20 81.50 Weather
1.40 1.30 31.90 31.90 30.10 29.90 79.90 79.40 Labor
9.30 10.90 23.90 23.60 24.60 25.70 67.60 72.70 War
10.30 11.70 26.50 26.20 26.20 26.00 60.10 59.20 Sports
37.40 37.80 42.50 47.20 42.50 45.70 71.90 71.30 Edu
Sc.DB(BoE)
Sc.DB(BoW)
Sc.FB(BoE)
SC.FB(BoW)
Sc.DB−FB(BoE)
SC.DB−FB(BoW)
TGT(BoW)
TGT(BoE)
16/22
36. Motivation Research question State-of-the-art Methodology Results Conclusions and Future Work
Findings -Examining the number of annotation needed for Twitter
classifier to outperform Sc. Db-FB
Investigated the impact of employing Sc. Db-FB classifier over the
Twitter classifier in terms of number of annotations
The performance of the Twitter classifier against the three CD classifiers
over the full learning curve
17/22
37. Motivation Research question State-of-the-art Methodology Results Conclusions and Future Work
Findings -Examining the number of annotation needed for Twitter
classifier to outperform Sc. Db-FB
Investigated the impact of employing Sc. Db-FB classifier over the
Twitter classifier in terms of number of annotations
The performance of the Twitter classifier against the three CD classifiers
over the full learning curve
=> In the absence of any annotated tweets, applying these CD
classifiers are beneficial
17/22
38. Motivation Research question State-of-the-art Methodology Results Conclusions and Future Work
Findings -Examining the number of annotation needed for Twitter
classifier to outperform the CD classifiers
Q3 : How similar or dissimilar are KSs to Twitter posts; and which
similarity measure does better reflect the lexical changes between KSs
and Twitter posts?
Compared χ2 , KL-divergence, cosine for each topic
χ2 obtained the best correlation with the performance of CD classifiers,
achived scores >70% for 32 cases
cosine obtained correlation scores >70% for 25 cases
KL obtained correlation scores >70% for 24 cases
18/22
39. Motivation Research question State-of-the-art Methodology Results Conclusions and Future Work
Findings -Examining the number of annotation needed for Twitter
classifier to outperform the CD classifiers
Q3 : How similar or dissimilar are KSs to Twitter posts; and which
similarity measure does better reflect the lexical changes between KSs
and Twitter posts?
Compared χ2 , KL-divergence, cosine for each topic
χ2 obtained the best correlation with the performance of CD classifiers,
achived scores >70% for 32 cases
cosine obtained correlation scores >70% for 25 cases
KL obtained correlation scores >70% for 24 cases
=> χ2 test is the best measure for quantifying the distributional
differences between KSs and Twitter.
18/22
40. Motivation Research question State-of-the-art Methodology Results Conclusions and Future Work
Conclusions and Future Work
We presented a first study towards understanding the usefulness of KSs in TC of
tweets at various granularities: lexical features (BoW) and entity features (BoE)
Our main findings are:
19/22
41. Motivation Research question State-of-the-art Methodology Results Conclusions and Future Work
Conclusions and Future Work
We presented a first study towards understanding the usefulness of KSs in TC of
tweets at various granularities: lexical features (BoW) and entity features (BoE)
Our main findings are:
In the absence of any annotated tweets, applying these CD classifiers are beneficial
19/22
42. Motivation Research question State-of-the-art Methodology Results Conclusions and Future Work
Conclusions and Future Work
We presented a first study towards understanding the usefulness of KSs in TC of
tweets at various granularities: lexical features (BoW) and entity features (BoE)
Our main findings are:
In the absence of any annotated tweets, applying these CD classifiers are beneficial
Out of the two KSs, Freebase topics seem to be much closer to the Twitter topics than
the DBpedia topics.
19/22
43. Motivation Research question State-of-the-art Methodology Results Conclusions and Future Work
Conclusions and Future Work
We presented a first study towards understanding the usefulness of KSs in TC of
tweets at various granularities: lexical features (BoW) and entity features (BoE)
Our main findings are:
In the absence of any annotated tweets, applying these CD classifiers are beneficial
Out of the two KSs, Freebase topics seem to be much closer to the Twitter topics than
the DBpedia topics.
The two KSs contain complementary information
19/22
44. Motivation Research question State-of-the-art Methodology Results Conclusions and Future Work
Conclusions and Future Work
We presented a first study towards understanding the usefulness of KSs in TC of
tweets at various granularities: lexical features (BoW) and entity features (BoE)
Our main findings are:
In the absence of any annotated tweets, applying these CD classifiers are beneficial
Out of the two KSs, Freebase topics seem to be much closer to the Twitter topics than
the DBpedia topics.
The two KSs contain complementary information
For the CD classifiers, on average BOW features were more useful than BoE features
19/22
45. Motivation Research question State-of-the-art Methodology Results Conclusions and Future Work
Conclusions and Future Work
We presented a first study towards understanding the usefulness of KSs in TC of
tweets at various granularities: lexical features (BoW) and entity features (BoE)
Our main findings are:
In the absence of any annotated tweets, applying these CD classifiers are beneficial
Out of the two KSs, Freebase topics seem to be much closer to the Twitter topics than
the DBpedia topics.
The two KSs contain complementary information
For the CD classifiers, on average BOW features were more useful than BoE features
For the Twitter classifiers, on average BOE features were more useful than BoW
features
19/22
46. Motivation Research question State-of-the-art Methodology Results Conclusions and Future Work
Conclusions and Future Work
We presented a first study towards understanding the usefulness of KSs in TC of
tweets at various granularities: lexical features (BoW) and entity features (BoE)
Our main findings are:
In the absence of any annotated tweets, applying these CD classifiers are beneficial
Out of the two KSs, Freebase topics seem to be much closer to the Twitter topics than
the DBpedia topics.
The two KSs contain complementary information
For the CD classifiers, on average BOW features were more useful than BoE features
For the Twitter classifiers, on average BOE features were more useful than BoW
features
We found χ2 test as being the best measure for quantifying the distributional
differences between KSs and Twitter.
19/22
47. Motivation Research question State-of-the-art Methodology Results Conclusions and Future Work
Conclusions and Future Work
We presented a first study towards understanding the usefulness of KSs in TC of
tweets at various granularities: lexical features (BoW) and entity features (BoE)
Our main findings are:
In the absence of any annotated tweets, applying these CD classifiers are beneficial
Out of the two KSs, Freebase topics seem to be much closer to the Twitter topics than
the DBpedia topics.
The two KSs contain complementary information
For the CD classifiers, on average BOW features were more useful than BoE features
For the Twitter classifiers, on average BOE features were more useful than BoW
features
We found χ2 test as being the best measure for quantifying the distributional
differences between KSs and Twitter.
Our future work will focus on building more accurate TC classifiers and
investigating better measures
19/22