SlideShare a Scribd company logo
1 of 50
Download to read offline
Motivation      Research question          State-of-the-art     Methodology        Results      Conclusions and Future Work




             Exploring the similarity between Social Knowledge Sources and
                Twitter for Cross-Domain Topic Classification of Tweets



                       Andrea Varga, Amparo E. Cano and Fabio Ciravegna

                             1
                                 Organisations Information and Knowledge (OAK) Research Group
                                                     University of Sheffield
                                            2
                                              Knowledge Management Institute (KMI)
                                                        Open University
                                                    KECSM 2012/ISWC 2012


                                                        Nov 12, 2012




                                                                                                                          1/22
Motivation       Research question   State-of-the-art   Methodology   Results   Conclusions and Future Work




Outline



         1   Motivation


         2   State-of-the-art


         3   Methodology


         4   Results


         5   Conclusions and Future Work




                                                                                                          2/22
Motivation     Research question    State-of-the-art       Methodology       Results       Conclusions and Future Work




Why classifying Tweets into topics?

             Topic classification (TC) of tweets can be important for multiple
             application:

                 Information Retrieval

                 Recommendation

                 Emergency responses, etc.

                Topic name                             Example tweets
                Disaster&Accident(DisAcc)              happening accident people dying could
                                                       phone ambulance wakakkaka xd
                Entertainment&Culture(EntCult)         google adwords commercial greeeat en-
                                                       joyed watching greeeeeat day
                Politics(Pol)                          quoting military source sk media reports
                                                       deployed rocket launchers decoys real
                Sports(Sports)                         ravens good position games left browns
                                                       bengals playoffs




                                                                                                                     3/22
Motivation     Research question    State-of-the-art     Methodology      Results        Conclusions and Future Work




What are the challenges in Topic Classification (TC) of Tweets?

             Special characteristics of tweets

                 the restricted size of a post (limited to 140 characters)

                 the frequent use of misspellings and jargons

                 the frequent use of abbreviations

                 the use of non-standard English: reflected in vocabulary and writing style
                   Topic name                          Example tweets
                   Disaster&Accident(DisAcc)           happening accident people dying could
                                                       phone ambulance wakakkaka xd
                   Entertainment&Culture(EntCult)      google      commercial greeeat enjoyed
                                                       watching day
                   Politics(Pol)                       quoting military source media reports de-
                                                       ployed rocket launchers decoys real
                   Sports(Sports)                      ravens good position games left browns
                                                       bengals playoffs




                                                                                                                   4/22
Motivation     Research question    State-of-the-art     Methodology      Results        Conclusions and Future Work




What are the challenges in Topic Classification (TC) of Tweets?

             Special characteristics of tweets

                 the restricted size of a post (limited to 140 characters)

                 the frequent use of misspellings and jargons

                 the frequent use of abbreviations

                 the use of non-standard English: reflected in vocabulary and writing style
                   Topic name                          Example tweets
                   Disaster&Accident(DisAcc)           happening accident people dying could
                                                       phone ambulance wakakkaka xd
                   Entertainment&Culture(EntCult)      google adwords commercial greeeat en-
                                                       joyed watching greeeeeat day
                   Politics(Pol)                       quoting military source media reports de-
                                                       ployed rocket launchers decoys real
                   Sports(Sports)                      ravens good position games left browns
                                                       bengals playoffs




                                                                                                                   4/22
Motivation     Research question    State-of-the-art    Methodology       Results       Conclusions and Future Work




What are the challenges in Topic Classification (TC) of Tweets?

             Special characteristics of tweets

                 the restricted size of a post (limited to 140 characters)

                 the frequent use of misspellings and jargons

                 the frequent use of abbreviations

                 the use of non-standard English: reflected in vocabulary and writing style
                   Topic name                          Example tweets
                   Disaster&Accident(DisAcc)           happening accident people dying could
                                                       phone ambulance wakakkaka xd
                   Entertainment&Culture(EntCult)      google adwords commercial greeeat en-
                                                       joyed watching greeeeeat day
                   Politics(Pol)                       quoting military source sk media reports
                                                       deployed rocket launchers decoys real
                   Sports(Sports)                      ravens good position games left browns
                                                       bengals playoffs




                                                                                                                  4/22
Motivation     Research question    State-of-the-art    Methodology       Results       Conclusions and Future Work




What are the challenges in Topic Classification (TC) of Tweets?

             Special characteristics of tweets

                 the restricted size of a post (limited to 140 characters)

                 the frequent use of misspellings and jargons

                 the frequent use of abbreviations

                 the use of non-standard English: reflected in vocabulary and writing style
                   Topic name                          Example tweets
                   Disaster&Accident(DisAcc)           happening accident people dying could
                                                       phone ambulance wakakkaka xd
                   Entertainment&Culture(EntCult)      google adwords commercial greeeat en-
                                                       joyed watching greeeeeat day
                   Politics(Pol)                       quoting military source sk media reports
                                                       deployed rocket launchers decoys real
                   Sports(Sports)                      ravens good position games left browns
                                                       bengals playoffs




                                                                                                                  4/22
Motivation         Research question    State-of-the-art    Methodology       Results       Conclusions and Future Work




What are the challenges in Topic Classification (TC) of Tweets?

                Special characteristics of tweets

                     the restricted size of a post (limited to 140 characters)

                     the frequent use of misspellings and jargons

                     the frequent use of abbreviations

                     the use of non-standard English: reflected in vocabulary and writing style
                       Topic name                          Example tweets
                       Disaster&Accident(DisAcc)           happening accident people dying could
                                                           phone ambulance wakakkaka xd
                       Entertainment&Culture(EntCult)      google adwords commercial greeeat en-
                                                           joyed watching greeeeeat day
                       Politics(Pol)                       quoting military source sk media reports
                                                           deployed rocket launchers decoys real
                       Sports(Sports)                      ravens good position games left browns
                                                           bengals playoffs


             => These characteristics poses additional challenges for traditional
                supervised machine learning approaches for building
                accurate TC of tweets
                                                                                                                      4/22
Motivation     Research question   State-of-the-art   Methodology      Results   Conclusions and Future Work




Why are Social Knowledge Sources (KS) relevant to Twitter?

             Data bottleneck problem: investigate an alternative approach inspired
             by domain adaptation/transfer learning for exploiting the information from
             Social Knowledge Sources (DBpedia and Freebase) for TC of Tweets


             Commonalities between KSs and Twitter

                 they are constantly edited by web users

                 they are social and built on a collaborative manner

                 they cover a large number of topics




                                                                                                           5/22
Motivation     Research question   State-of-the-art   Methodology      Results   Conclusions and Future Work




Why are Social Knowledge Sources (KS) relevant to Twitter?

             Data bottleneck problem: investigate an alternative approach inspired
             by domain adaptation/transfer learning for exploiting the information from
             Social Knowledge Sources (DBpedia and Freebase) for TC of Tweets


             Commonalities between KSs and Twitter

                 they are constantly edited by web users

                 they are social and built on a collaborative manner

                 they cover a large number of topics




                                                                                                           5/22
Motivation     Research question   State-of-the-art   Methodology      Results   Conclusions and Future Work




Why are Social Knowledge Sources (KS) relevant to Twitter?

             Data bottleneck problem: investigate an alternative approach inspired
             by domain adaptation/transfer learning for exploiting the information from
             Social Knowledge Sources (DBpedia and Freebase) for TC of Tweets


             Commonalities between KSs and Twitter

                 they are constantly edited by web users

                 they are social and built on a collaborative manner

                 they cover a large number of topics




                                                                                                           5/22
Motivation     Research question   State-of-the-art   Methodology      Results   Conclusions and Future Work




Why are Social Knowledge Sources (KS) relevant to Twitter?

             Data bottleneck problem: investigate an alternative approach inspired
             by domain adaptation/transfer learning for exploiting the information from
             Social Knowledge Sources (DBpedia and Freebase) for TC of Tweets


             Commonalities between KSs and Twitter

                 they are constantly edited by web users

                 they are social and built on a collaborative manner

                 they cover a large number of topics




                                                                                                           5/22
Motivation     Research question   State-of-the-art   Methodology      Results   Conclusions and Future Work




Why are Social Knowledge Sources (KS) relevant to Twitter?

             Data bottleneck problem: investigate an alternative approach inspired
             by domain adaptation/transfer learning for exploiting the information from
             Social Knowledge Sources (DBpedia and Freebase) for TC of Tweets


             Commonalities between KSs and Twitter

                 they are constantly edited by web users

                 they are social and built on a collaborative manner

                 they cover a large number of topics

             More importantly: KSs contain a large number of annotated data on a
             large number of topics




                                                                                                           5/22
Motivation         Research question   State-of-the-art   Methodology   Results   Conclusions and Future Work




Research questions




             1   Are KSs relevant for topic classification of Tweets?

             2   Which features make the KSs look more similar to Twitter?

             3   How similar or dissimilar are KSs to Twitter? Which similarity measure
                 does better quantify the lexical changes between KSs and Twitter?




                                                                                                            6/22
Motivation     Research question   State-of-the-art   Methodology      Results    Conclusions and Future Work




State-of-the-art approaches for TC of Tweets

             Using DBpedia for Topic Classification of Tweets:
                 Wikify (Mihalcea, R. and Csomai, A., 2007)
                 Enriching unstructured text with Wikipedia links (D. Milne and I. H. Witten,
                 2008)
                 Tagme (P. Ferragina and U. Scaiella., 2010)
                 Topical Social Sensor (P. K. P. N. Mendes et al., 2010)
                 Vector space model (Oscar Munoz-Garcia et al. 2011)

             Using Freebase for Topic Classification of Tweets:
                 Clustering based approach (S.P.Kasiviswanathan et al., 2011)


             Our main contribution:
                 Understanding the similarity between KSs and Twitter

                 Exploring multiple KSs (DBpedia + Freebase)

                 Investigating various statistical metrics for quantifying the
                 similarity between KSs and Twitter




                                                                                                            7/22
Motivation   Research question   State-of-the-art   Methodology   Results   Conclusions and Future Work




Methodology followed




                                                                                                      8/22
Motivation         Research question      State-of-the-art             Methodology     Results         Conclusions and Future Work




Methodology followed

             1   Collecting Data from KSs




                            Sc. DB     Sc. FB                  Sc. DB-FB




                                                Retrieve articles                    Retrieve tweets




                                                   Concept                              Concept
                                                  enrichment                           enrichment




                                                  Build Cross-
                                                                                     Annotate Tweets
                                                domain Classifier




                                                                                                                                 8/22
Motivation         Research question      State-of-the-art             Methodology     Results         Conclusions and Future Work




Methodology followed

             1   Collecting Data from KSs

             2   Building Cross-Domain (CD) Topic Classifier of Tweets


                            Sc. DB     Sc. FB                  Sc. DB-FB




                                                Retrieve articles                    Retrieve tweets




                                                   Concept                              Concept
                                                  enrichment                           enrichment




                                                  Build Cross-
                                                                                     Annotate Tweets
                                                domain Classifier




                                                                                                                                 8/22
Motivation         Research question      State-of-the-art             Methodology      Results        Conclusions and Future Work




Methodology followed

             1   Collecting Data from KSs

             2   Building Cross-Domain (CD) Topic Classifier of Tweets

             3   Measuring Distributional Changes Between KSs and Twitter

                           Sc. DB      Sc. FB                  Sc. DB-FB




                                                Retrieve articles                    Retrieve tweets




                                                   Concept                              Concept
                                                  enrichment                           enrichment




                                                  Build Cross-
                                                                                     Annotate Tweets
                                                domain Classifier




                                                                                                                                 8/22
Motivation                 Research question                  State-of-the-art               Methodology            Results                        Conclusions and Future Work




Step 1: Collecting Data from KSs

                       Twitter corpus collected in Abel et al. (2011), tweets posted between October 2010 and
                                                                                               Twitter multilabel frequency
                       January 2011, annotated with 17 topics
                       Random selection of 1,000 articles/tweets from DBpedia/Freebase/Twitter for each topic =>
                       9,465 articles from DBpedia; 16,915Freebase multilabel frequency and 12,412 tweets
                           Dbpedia multilabel frequency    articles from Freebase;
                       Preprocessing: removal of hastags, mentions and URLs from tweets; taking top-1000
                                                                                                                                 71%
                       features for each topic
                          Dbpedia multilabel frequency                           Freebase multilabel frequency                   Twitter multilabel frequency




                                                                                                                           71%
              88.6%

                                                                                                                                                                                      0.1%
                                                                                                                                                                                      1%
             88.6%                                              99.9%                                                   0.1%
                                                                 0.9%
                                                                 1.8%                                                                                                                 5.6%

                                                               0.9% 99.9%                                        0.1%                                                          0.1%
                                                                                                                                                                               1%
                                                               1.8%
                                                              8.6%
                                                                                                                                                                               5.6%

                                                            8.6%



                                                                                                                                                                   22.3%
                                                                                                                                                               22.3%


                   1              8            2          3+4+5+6+7+9                         1       2
                                                                                                                                  1
                                                                                                                                       1   2
                                                                                                                                               2   3
                                                                                                                                                       3   4
                                                                                                                                                               4   6+5
                                                                                                                                                                         6+5
               1              8            2             3+4+5+6+7+9                          1   2




                                                                                                                                                                                             9/22
Motivation      Research question           State-of-the-art      Methodology               Results            Conclusions and Future Work




Step 1: Collecting Data from KSs

                         Business_Finance   Disaster_Accident   Education       Entertainment         Environment



                              Health          Human Interest     Labor           Law_Crime            Technology_IT



                             Religion          Social Issues    Weather            Sports             War_Conflict



                                                                 Politics



             Retrieval of articles for a given topic (e.g. Politics):
                  from DBpedia: executing SPARQL queries for retrieving category names
                  containing the topic name:
                          Category:Politics_of_the_United_States
                          Category:National_Democratic_Party_Egypt_politicians
                          etc.

                  from Freebase: accessing Text Service API for articles belonging to the
                  topic:
                          for underspecified topics/domains: consider articles containing the topic in their
                          titles




                                                                                                                                        10/22
Motivation     Research question   State-of-the-art   Methodology   Results   Conclusions and Future Work




Step 2: Building Cross-Domain (CD) Topic Classifier of Tweets

             Considering two different feature sets:




                                                                                                       11/22
Motivation     Research question   State-of-the-art   Methodology   Results      Conclusions and Future Work




Step 2: Building Cross-Domain (CD) Topic Classifier of Tweets

             Considering two different feature sets:
                 BOW: tf.idf value of the words present the examples (articles or tweets)




                                                                                                          11/22
Motivation     Research question      State-of-the-art             Methodology     Results         Conclusions and Future Work




Step 2: Building Cross-Domain (CD) Topic Classifier of Tweets

             Considering two different feature sets:
                 BOW: tf.idf value of the words present the examples (articles or tweets)

                 BOE: tf.idf value of the words and entity+concept pairs present the examples
                 (articles or tweets)

                        Sc. DB     Sc. FB                  Sc. DB-FB




                                            Retrieve articles                    Retrieve tweets




                                               Concept                              Concept
                                              enrichment                           enrichment




                                              Build Cross-
                                                                                 Annotate Tweets
                                            domain Classifier




                                                                                                                            11/22
Motivation     Research question    State-of-the-art   Methodology                 Results    Conclusions and Future Work




Step 3: Measuring Distributional Changes Between KSs and Twitter
                               −
                               →
             Building a vector ds for each the source dataset (Sc.DB, Sc.Fb,
                                       −
                                       →
             Sc.Db-FB) and a vector dt for the target dataset (Twitter) consisting of
             the TF-IDF weight for either the BoW or BoE feature sets

             statistical measures applied:
                                    (O−E)2
                 χ2 test: χ2 =       E
                                          , where O is the observed value for a feature, while
                 E is the expected value calculated on the basis of the joint corpus

                 Kullback-Leibler symmetric distance:
                    − −
                    → →                    −
                                           →         −
                                                     →               →
                                                                     −
                                                                     ds (f )
                 KL(ds || dt ) = f ∈F ∪FT (ds (f ) − dt (f )) log    →
                                                                     −
                                       S                             dt (f )
                                                                     FS ∪FT     →
                                                                                −           →
                                                                                            −
                                           − −
                                           → →                       k =1
                                                                               ( ds (fS )× ds (fT ))
                                                                                       k         k
                 cosine similarity: cosine(ds , dt ) =      FS ∪FT     →
                                                                       −                          →
                                                                                                  −
                                                                      ( ds (fS )) 2 × FS ∪FT ( d (f ))2
                                                            k =1               k         k =1     kt T




                                                                                                                       12/22
Motivation     Research question   State-of-the-art   Methodology   Results   Conclusions and Future Work




Experimental setting




             1-vs-all approach, building individual CD classifier for each topic, SVM
             classifiers, performed 5 cross-fold validation


             Sc-Db, Sc-Fb, Sc-Db-Fb classifiers trained on full KS data, evaluated on
             20% Twitter data 2,482 tweets)


             TGT classifier: trained on 80% Twitter data, evaluated on 20% Twitter
             data (2,482 tweets)




                                                                                                       13/22
Motivation      Research question   State-of-the-art   Methodology   Results   Conclusions and Future Work




Findings -Classification performance in F1 measure

             Q1 : Which KS reflects better the lexical variation in Twitter?




                                                                                                        14/22
Motivation      Research question                          State-of-the-art                                              Methodology                         Results   Conclusions and Future Work




Findings -Classification performance in F1 measure

             Q1 : Which KS reflects better the lexical variation in Twitter?


                               21.40         21.20         11.50         15.30         18.60            19.10             47.50       46.50       BusFi

                               15.30         15.10         15.50         16.20         19.50            20.20             42.20       43.10       EntCult

                               23.40         25.10         14.40         14.70         21.00            20.40             58.40       58.50       Religion

                               28.60         30.30         25.60         24.70         26.90            25.40             51.70       51.80       Health

                               22.20         21.00         27.80         26.80         24.20            26.80             45.10       44.90       Pol

                                0.90         2.70          16.80         17.80         14.80            13.30             46.80       46.40       Law

                                1.40         2.30          17.20         19.50         11.30            13.90             41.60       42.60       HospRecr

                                1.30         2.00           8.80         9.00          9.70             9.10              44.20       44.00       SocIssue

                                8.30         9.70          14.50         14.50         21.00            21.10             57.50       59.20       DisAcc

                                1.60         2.40          18.40         18.60         12.40            9.90              57.40       58.00       TechIT

                               15.20         14.20          2.20         2.20          8.90             8.40              46.60       48.30       Env

                                1.10         1.40           2.00         1.60          1.50             2.20              33.60       34.20       HumInt

                                3.10         7.00          39.80         39.90         36.70            36.00             81.20       81.50       Weather

                                1.40         1.30          31.90         31.90         30.10            29.90             79.90       79.40       Labor

                                9.30         10.90         23.90         23.60         24.60            25.70             67.60       72.70       War

                               10.30         11.70         26.50         26.20         26.20            26.00             60.10       59.20       Sports

                               37.40         37.80         42.50         47.20         42.50            45.70             71.90       71.30       Edu
                                Sc.DB(BoE)



                                              Sc.DB(BoW)



                                                            Sc.FB(BoE)



                                                                          SC.FB(BoW)



                                                                                        Sc.DB−FB(BoE)



                                                                                                         SC.DB−FB(BoW)



                                                                                                                           TGT(BoW)



                                                                                                                                       TGT(BoE)




                                                                                                                                                                                                14/22
Motivation      Research question                          State-of-the-art                                              Methodology                         Results   Conclusions and Future Work




Findings -Classification performance in F1 measure

             Q1 : Which KS reflects better the lexical variation in Twitter?


                               21.40         21.20         11.50         15.30         18.60            19.10             47.50       46.50       BusFi

                               15.30         15.10         15.50         16.20         19.50            20.20             42.20       43.10       EntCult

                               23.40         25.10         14.40         14.70         21.00            20.40             58.40       58.50       Religion

                               28.60         30.30         25.60         24.70         26.90            25.40             51.70       51.80       Health

                               22.20         21.00         27.80         26.80         24.20            26.80             45.10       44.90       Pol

                               0.90          2.70          16.80         17.80         14.80            13.30             46.80       46.40       Law

                               1.40          2.30          17.20         19.50         11.30            13.90             41.60       42.60       HospRecr

                               1.30          2.00          8.80          9.00          9.70             9.10              44.20       44.00       SocIssue

                               8.30          9.70          14.50         14.50         21.00            21.10             57.50       59.20       DisAcc

                               1.60          2.40          18.40         18.60         12.40            9.90              57.40       58.00       TechIT

                               15.20         14.20         2.20          2.20          8.90             8.40              46.60       48.30       Env

                               1.10          1.40          2.00          1.60          1.50             2.20              33.60       34.20       HumInt

                               3.10          7.00          39.80         39.90         36.70            36.00             81.20       81.50       Weather

                               1.40          1.30          31.90         31.90         30.10            29.90             79.90       79.40       Labor

                               9.30          10.90         23.90         23.60         24.60            25.70             67.60       72.70       War

                               10.30         11.70         26.50         26.20         26.20            26.00             60.10       59.20       Sports

                               37.40         37.80         42.50         47.20         42.50            45.70             71.90       71.30       Edu
                                Sc.DB(BoE)



                                              Sc.DB(BoW)



                                                            Sc.FB(BoE)



                                                                          SC.FB(BoW)



                                                                                        Sc.DB−FB(BoE)



                                                                                                         SC.DB−FB(BoW)



                                                                                                                           TGT(BoW)



                                                                                                                                       TGT(BoE)




                                                                                                                                                                                                14/22
Motivation      Research question                          State-of-the-art                                              Methodology                         Results   Conclusions and Future Work




Findings -Classification performance in F1 measure

             Q1 : Which KS reflects better the lexical variation in Twitter?


                               21.40         21.20         11.50         15.30         18.60            19.10             47.50       46.50       BusFi

                               15.30         15.10         15.50         16.20         19.50            20.20             42.20       43.10       EntCult

                               23.40         25.10         14.40         14.70         21.00            20.40             58.40       58.50       Religion

                               28.60         30.30         25.60         24.70         26.90            25.40             51.70       51.80       Health

                               22.20         21.00         27.80         26.80         24.20            26.80             45.10       44.90       Pol

                               0.90          2.70          16.80         17.80         14.80            13.30             46.80       46.40       Law

                               1.40          2.30          17.20         19.50         11.30            13.90             41.60       42.60       HospRecr

                               1.30          2.00          8.80          9.00          9.70             9.10              44.20       44.00       SocIssue

                               8.30          9.70          14.50         14.50         21.00            21.10             57.50       59.20       DisAcc

                               1.60          2.40          18.40         18.60         12.40            9.90              57.40       58.00       TechIT

                               15.20         14.20         2.20          2.20          8.90             8.40              46.60       48.30       Env

                               1.10          1.40          2.00          1.60          1.50             2.20              33.60       34.20       HumInt

                               3.10          7.00          39.80         39.90         36.70            36.00             81.20       81.50       Weather

                               1.40          1.30          31.90         31.90         30.10            29.90             79.90       79.40       Labor

                               9.30          10.90         23.90         23.60         24.60            25.70             67.60       72.70       War

                               10.30         11.70         26.50         26.20         26.20            26.00             60.10       59.20       Sports

                               37.40         37.80         42.50         47.20         42.50            45.70             71.90       71.30       Edu
                                Sc.DB(BoE)



                                              Sc.DB(BoW)



                                                            Sc.FB(BoE)



                                                                          SC.FB(BoW)



                                                                                        Sc.DB−FB(BoE)



                                                                                                         SC.DB−FB(BoW)



                                                                                                                           TGT(BoW)



                                                                                                                                       TGT(BoE)




                                                                                                                                                                                                14/22
Motivation      Research question                          State-of-the-art                                              Methodology                         Results   Conclusions and Future Work




Findings -Classification performance in F1 measure

             Q1 : Which KS reflects better the lexical variation in Twitter?


                               21.40         21.20         11.50         15.30         18.60            19.10             47.50       46.50       BusFi

                               15.30         15.10         15.50         16.20         19.50            20.20             42.20       43.10       EntCult

                               23.40         25.10         14.40         14.70         21.00            20.40             58.40       58.50       Religion

                               28.60         30.30         25.60         24.70         26.90            25.40             51.70       51.80       Health

                               22.20         21.00         27.80         26.80         24.20            26.80             45.10       44.90       Pol

                               0.90          2.70          16.80         17.80         14.80            13.30             46.80       46.40       Law

                               1.40          2.30          17.20         19.50         11.30            13.90             41.60       42.60       HospRecr

                               1.30          2.00          8.80          9.00          9.70             9.10              44.20       44.00       SocIssue

                               8.30          9.70          14.50         14.50         21.00            21.10             57.50       59.20       DisAcc

                               1.60          2.40          18.40         18.60         12.40            9.90              57.40       58.00       TechIT

                               15.20         14.20         2.20          2.20          8.90             8.40              46.60       48.30       Env

                               1.10          1.40          2.00          1.60          1.50             2.20              33.60       34.20       HumInt

                               3.10          7.00          39.80         39.90         36.70            36.00             81.20       81.50       Weather

                               1.40          1.30          31.90         31.90         30.10            29.90             79.90       79.40       Labor

                               9.30          10.90         23.90         23.60         24.60            25.70             67.60       72.70       War

                               10.30         11.70         26.50         26.20         26.20            26.00             60.10       59.20       Sports

                               37.40         37.80         42.50         47.20         42.50            45.70             71.90       71.30       Edu
                                Sc.DB(BoE)



                                              Sc.DB(BoW)



                                                            Sc.FB(BoE)



                                                                          SC.FB(BoW)



                                                                                        Sc.DB−FB(BoE)



                                                                                                         SC.DB−FB(BoW)



                                                                                                                           TGT(BoW)



                                                                                                                                       TGT(BoE)




                                                                                                                                                                                                14/22
Motivation      Research question                          State-of-the-art                                              Methodology                         Results   Conclusions and Future Work




Findings -Classification performance in F1 measure

             Q1 : Which KS reflects better the lexical variation in Twitter?

                  Sc.Db-FB showed best performance, followed by Sc.Fb and Sc.Db


                               21.40         21.20         11.50         15.30         18.60            19.10             47.50       46.50       BusFi

                               15.30         15.10         15.50         16.20         19.50            20.20             42.20       43.10       EntCult

                               23.40         25.10         14.40         14.70         21.00            20.40             58.40       58.50       Religion

                               28.60         30.30         25.60         24.70         26.90            25.40             51.70       51.80       Health

                               22.20         21.00         27.80         26.80         24.20            26.80             45.10       44.90       Pol

                               0.90          2.70          16.80         17.80         14.80            13.30             46.80       46.40       Law

                               1.40          2.30          17.20         19.50         11.30            13.90             41.60       42.60       HospRecr

                               1.30          2.00          8.80          9.00          9.70             9.10              44.20       44.00       SocIssue

                               8.30          9.70          14.50         14.50         21.00            21.10             57.50       59.20       DisAcc

                               1.60          2.40          18.40         18.60         12.40            9.90              57.40       58.00       TechIT

                               15.20         14.20         2.20          2.20          8.90             8.40              46.60       48.30       Env

                               1.10          1.40          2.00          1.60          1.50             2.20              33.60       34.20       HumInt

                               3.10          7.00          39.80         39.90         36.70            36.00             81.20       81.50       Weather

                               1.40          1.30          31.90         31.90         30.10            29.90             79.90       79.40       Labor

                               9.30          10.90         23.90         23.60         24.60            25.70             67.60       72.70       War

                               10.30         11.70         26.50         26.20         26.20            26.00             60.10       59.20       Sports

                               37.40         37.80         42.50         47.20         42.50            45.70             71.90       71.30       Edu
                                Sc.DB(BoE)



                                              Sc.DB(BoW)



                                                            Sc.FB(BoE)



                                                                          SC.FB(BoW)



                                                                                        Sc.DB−FB(BoE)



                                                                                                         SC.DB−FB(BoW)



                                                                                                                           TGT(BoW)



                                                                                                                                       TGT(BoE)




                                                                                                                                                                                                15/22
Motivation      Research question                          State-of-the-art                                              Methodology                         Results   Conclusions and Future Work




Findings -Classification performance in F1 measure

             Q1 : Which KS reflects better the lexical variation in Twitter?

                  Sc.Db-FB showed best performance, followed by Sc.Fb and Sc.Db


                               21.40         21.20         11.50         15.30         18.60            19.10             47.50       46.50       BusFi

                               15.30         15.10         15.50         16.20         19.50            20.20             42.20       43.10       EntCult

                               23.40         25.10         14.40         14.70         21.00            20.40             58.40       58.50       Religion

                               28.60         30.30         25.60         24.70         26.90            25.40             51.70       51.80       Health

                               22.20         21.00         27.80         26.80         24.20            26.80             45.10       44.90       Pol

                               0.90          2.70          16.80         17.80         14.80            13.30             46.80       46.40       Law

                               1.40          2.30          17.20         19.50         11.30            13.90             41.60       42.60       HospRecr

                               1.30          2.00          8.80          9.00          9.70             9.10              44.20       44.00       SocIssue

                               8.30          9.70          14.50         14.50         21.00            21.10             57.50       59.20       DisAcc

                               1.60          2.40          18.40         18.60         12.40            9.90              57.40       58.00       TechIT

                               15.20         14.20         2.20          2.20          8.90             8.40              46.60       48.30       Env

                               1.10          1.40          2.00          1.60          1.50             2.20              33.60       34.20       HumInt

                               3.10          7.00          39.80         39.90         36.70            36.00             81.20       81.50       Weather

                               1.40          1.30          31.90         31.90         30.10            29.90             79.90       79.40       Labor

                               9.30          10.90         23.90         23.60         24.60            25.70             67.60       72.70       War

                               10.30         11.70         26.50         26.20         26.20            26.00             60.10       59.20       Sports

                               37.40         37.80         42.50         47.20         42.50            45.70             71.90       71.30       Edu
                                Sc.DB(BoE)



                                              Sc.DB(BoW)



                                                            Sc.FB(BoE)



                                                                          SC.FB(BoW)



                                                                                        Sc.DB−FB(BoE)



                                                                                                         SC.DB−FB(BoW)



                                                                                                                           TGT(BoW)



                                                                                                                                       TGT(BoE)




                                                                                                                                                                                                15/22
Motivation      Research question                          State-of-the-art                                              Methodology                         Results   Conclusions and Future Work




Findings -Classification performance in F1 measure

             Q1 : Which KS reflects better the lexical variation in Twitter?

                  Sc.Db-FB showed best performance, followed by Sc.Fb and Sc.Db


                               21.40         21.20         11.50         15.30         18.60            19.10             47.50       46.50       BusFi

                               15.30         15.10         15.50         16.20         19.50            20.20             42.20       43.10       EntCult

                               23.40         25.10         14.40         14.70         21.00            20.40             58.40       58.50       Religion

                               28.60         30.30         25.60         24.70         26.90            25.40             51.70       51.80       Health

                               22.20         21.00         27.80         26.80         24.20            26.80             45.10       44.90       Pol

                                0.90         2.70          16.80         17.80         14.80            13.30             46.80       46.40       Law

                                1.40         2.30          17.20         19.50         11.30            13.90             41.60       42.60       HospRecr

                                1.30         2.00           8.80         9.00          9.70             9.10              44.20       44.00       SocIssue

                                8.30         9.70          14.50         14.50         21.00            21.10             57.50       59.20       DisAcc

                                1.60         2.40          18.40         18.60         12.40            9.90              57.40       58.00       TechIT

                               15.20         14.20          2.20         2.20          8.90             8.40              46.60       48.30       Env

                                1.10         1.40           2.00         1.60          1.50             2.20              33.60       34.20       HumInt

                                3.10         7.00          39.80         39.90         36.70            36.00             81.20       81.50       Weather

                                1.40         1.30          31.90         31.90         30.10            29.90             79.90       79.40       Labor

                                9.30         10.90         23.90         23.60         24.60            25.70             67.60       72.70       War

                               10.30         11.70         26.50         26.20         26.20            26.00             60.10       59.20       Sports

                               37.40         37.80         42.50         47.20         42.50            45.70             71.90       71.30       Edu
                                Sc.DB(BoE)



                                              Sc.DB(BoW)



                                                            Sc.FB(BoE)



                                                                          SC.FB(BoW)



                                                                                        Sc.DB−FB(BoE)



                                                                                                         SC.DB−FB(BoW)



                                                                                                                           TGT(BoW)



                                                                                                                                       TGT(BoE)




                                                                                                                                                                                                15/22
Motivation     Research question                          State-of-the-art                                              Methodology                         Results   Conclusions and Future Work




Findings -Classification performance in F1 measure

             Q2 : What feature makes the KSs look more similar to Twitter?

                 BoW features were found better than BoE for CD classifiers
                 BoE features were found better than BoW for TGT

                              21.40         21.20         11.50         15.30         18.60            19.10             47.50       46.50       BusFi

                              15.30         15.10         15.50         16.20         19.50            20.20             42.20       43.10       EntCult

                              23.40         25.10         14.40         14.70         21.00            20.40             58.40       58.50       Religion

                              28.60         30.30         25.60         24.70         26.90            25.40             51.70       51.80       Health

                              22.20         21.00         27.80         26.80         24.20            26.80             45.10       44.90       Pol

                               0.90         2.70          16.80         17.80         14.80            13.30             46.80       46.40       Law

                               1.40         2.30          17.20         19.50         11.30            13.90             41.60       42.60       HospRecr

                               1.30         2.00           8.80         9.00          9.70             9.10              44.20       44.00       SocIssue

                               8.30         9.70          14.50         14.50         21.00            21.10             57.50       59.20       DisAcc

                               1.60         2.40          18.40         18.60         12.40            9.90              57.40       58.00       TechIT

                              15.20         14.20          2.20         2.20          8.90             8.40              46.60       48.30       Env

                               1.10         1.40           2.00         1.60          1.50             2.20              33.60       34.20       HumInt

                               3.10         7.00          39.80         39.90         36.70            36.00             81.20       81.50       Weather

                               1.40         1.30          31.90         31.90         30.10            29.90             79.90       79.40       Labor

                               9.30         10.90         23.90         23.60         24.60            25.70             67.60       72.70       War

                              10.30         11.70         26.50         26.20         26.20            26.00             60.10       59.20       Sports

                              37.40         37.80         42.50         47.20         42.50            45.70             71.90       71.30       Edu
                               Sc.DB(BoE)



                                             Sc.DB(BoW)



                                                           Sc.FB(BoE)



                                                                         SC.FB(BoW)



                                                                                       Sc.DB−FB(BoE)



                                                                                                        SC.DB−FB(BoW)



                                                                                                                          TGT(BoW)



                                                                                                                                      TGT(BoE)




                                                                                                                                                                                               16/22
Motivation     Research question   State-of-the-art   Methodology   Results   Conclusions and Future Work




Findings -Examining the number of annotation needed for Twitter
classifier to outperform Sc. Db-FB

             Investigated the impact of employing Sc. Db-FB classifier over the
             Twitter classifier in terms of number of annotations


             The performance of the Twitter classifier against the three CD classifiers
             over the full learning curve




                                                                                                       17/22
Motivation         Research question   State-of-the-art   Methodology   Results   Conclusions and Future Work




Findings -Examining the number of annotation needed for Twitter
classifier to outperform Sc. Db-FB

                Investigated the impact of employing Sc. Db-FB classifier over the
                Twitter classifier in terms of number of annotations


                The performance of the Twitter classifier against the three CD classifiers
                over the full learning curve




             => In the absence of any annotated tweets, applying these CD
                classifiers are beneficial




                                                                                                           17/22
Motivation     Research question   State-of-the-art   Methodology   Results     Conclusions and Future Work




Findings -Examining the number of annotation needed for Twitter
classifier to outperform the CD classifiers

             Q3 : How similar or dissimilar are KSs to Twitter posts; and which
             similarity measure does better reflect the lexical changes between KSs
             and Twitter posts?

                 Compared χ2 , KL-divergence, cosine for each topic

                 χ2 obtained the best correlation with the performance of CD classifiers,
                 achived scores >70% for 32 cases

                 cosine obtained correlation scores >70% for 25 cases

                 KL obtained correlation scores >70% for 24 cases




                                                                                                         18/22
Motivation         Research question   State-of-the-art   Methodology   Results     Conclusions and Future Work




Findings -Examining the number of annotation needed for Twitter
classifier to outperform the CD classifiers

                Q3 : How similar or dissimilar are KSs to Twitter posts; and which
                similarity measure does better reflect the lexical changes between KSs
                and Twitter posts?

                     Compared χ2 , KL-divergence, cosine for each topic

                     χ2 obtained the best correlation with the performance of CD classifiers,
                     achived scores >70% for 32 cases

                     cosine obtained correlation scores >70% for 25 cases

                     KL obtained correlation scores >70% for 24 cases


             => χ2 test is the best measure for quantifying the distributional
                differences between KSs and Twitter.




                                                                                                             18/22
Motivation      Research question    State-of-the-art   Methodology   Results    Conclusions and Future Work




Conclusions and Future Work

             We presented a first study towards understanding the usefulness of KSs in TC of
             tweets at various granularities: lexical features (BoW) and entity features (BoE)


             Our main findings are:




                                                                                                          19/22
Motivation      Research question    State-of-the-art   Methodology       Results       Conclusions and Future Work




Conclusions and Future Work

             We presented a first study towards understanding the usefulness of KSs in TC of
             tweets at various granularities: lexical features (BoW) and entity features (BoE)


             Our main findings are:
                  In the absence of any annotated tweets, applying these CD classifiers are beneficial




                                                                                                                 19/22
Motivation      Research question    State-of-the-art    Methodology       Results       Conclusions and Future Work




Conclusions and Future Work

             We presented a first study towards understanding the usefulness of KSs in TC of
             tweets at various granularities: lexical features (BoW) and entity features (BoE)


             Our main findings are:
                  In the absence of any annotated tweets, applying these CD classifiers are beneficial

                  Out of the two KSs, Freebase topics seem to be much closer to the Twitter topics than
                  the DBpedia topics.




                                                                                                                  19/22
Motivation      Research question    State-of-the-art    Methodology       Results       Conclusions and Future Work




Conclusions and Future Work

             We presented a first study towards understanding the usefulness of KSs in TC of
             tweets at various granularities: lexical features (BoW) and entity features (BoE)


             Our main findings are:
                  In the absence of any annotated tweets, applying these CD classifiers are beneficial

                  Out of the two KSs, Freebase topics seem to be much closer to the Twitter topics than
                  the DBpedia topics.

                  The two KSs contain complementary information




                                                                                                                  19/22
Motivation      Research question    State-of-the-art    Methodology       Results       Conclusions and Future Work




Conclusions and Future Work

             We presented a first study towards understanding the usefulness of KSs in TC of
             tweets at various granularities: lexical features (BoW) and entity features (BoE)


             Our main findings are:
                  In the absence of any annotated tweets, applying these CD classifiers are beneficial

                  Out of the two KSs, Freebase topics seem to be much closer to the Twitter topics than
                  the DBpedia topics.

                  The two KSs contain complementary information

                  For the CD classifiers, on average BOW features were more useful than BoE features




                                                                                                                  19/22
Motivation      Research question    State-of-the-art    Methodology       Results       Conclusions and Future Work




Conclusions and Future Work

             We presented a first study towards understanding the usefulness of KSs in TC of
             tweets at various granularities: lexical features (BoW) and entity features (BoE)


             Our main findings are:
                  In the absence of any annotated tweets, applying these CD classifiers are beneficial

                  Out of the two KSs, Freebase topics seem to be much closer to the Twitter topics than
                  the DBpedia topics.

                  The two KSs contain complementary information

                  For the CD classifiers, on average BOW features were more useful than BoE features

                  For the Twitter classifiers, on average BOE features were more useful than BoW
                  features




                                                                                                                  19/22
Motivation      Research question     State-of-the-art    Methodology       Results        Conclusions and Future Work




Conclusions and Future Work

             We presented a first study towards understanding the usefulness of KSs in TC of
             tweets at various granularities: lexical features (BoW) and entity features (BoE)


             Our main findings are:
                  In the absence of any annotated tweets, applying these CD classifiers are beneficial

                  Out of the two KSs, Freebase topics seem to be much closer to the Twitter topics than
                  the DBpedia topics.

                  The two KSs contain complementary information

                  For the CD classifiers, on average BOW features were more useful than BoE features

                  For the Twitter classifiers, on average BOE features were more useful than BoW
                  features

                  We found χ2 test as being the best measure for quantifying the distributional
                  differences between KSs and Twitter.




                                                                                                                    19/22
Motivation      Research question     State-of-the-art    Methodology       Results        Conclusions and Future Work




Conclusions and Future Work

             We presented a first study towards understanding the usefulness of KSs in TC of
             tweets at various granularities: lexical features (BoW) and entity features (BoE)


             Our main findings are:
                  In the absence of any annotated tweets, applying these CD classifiers are beneficial

                  Out of the two KSs, Freebase topics seem to be much closer to the Twitter topics than
                  the DBpedia topics.

                  The two KSs contain complementary information

                  For the CD classifiers, on average BOW features were more useful than BoE features

                  For the Twitter classifiers, on average BOE features were more useful than BoW
                  features

                  We found χ2 test as being the best measure for quantifying the distributional
                  differences between KSs and Twitter.


             Our future work will focus on building more accurate TC classifiers and
             investigating better measures


                                                                                                                    19/22
Motivation                  Research question                                   State-of-the-art                              Methodology                                 Results                           Conclusions and Future Work




Corpus Analysis - Size of vocabulary




             '%!!!"                                                                                                                                   'J!I'"
                                                                                                                                                                                     'I&'L"
             '$!!!"                                                                                    'I!$L"                          'IIIJ"
                                                                                            '#$#J"                                                                        '#J!K"
                                                                                   ''L%L"                                                      ''L$I"
             '#!!!"                                                                                                          ''J#I"
                                                        '''!!"
                            '!K#'"                                                                                                                                                                         '!$IJ"
                                    '!!L#"           '!#'#"    '!!I'"                                                                                         '!!#L"                             LLII"
                                                             LJJ$" LI!K"                                          L&#I"              LK'&"                                         L&KJ"
             '!!!!"               LI$#"          L##'"                                   LI$#"       L$##"     LI$#" LI$#"                                  LI$#"                                        LI$#"
                        &K%&"                                                                                                                                                                &&J&"
                                                                                                                                                                        &'&I"
                                                                                                                                                                                                                     GMG"
              &!!!"                            K$'#"
                                                                                                      %JK'"                                                              %&$#"      %K#&"                            F/N-("
                                                                                                                                      %$'#"
                                                                                            JK%#"                                                 J&&&"
              %!!!"                                                                                                                                                                                                  F/N+("
                                                                                $%&$"                                       $JJ'"
                          I&!I"                                     I&%!"                                                                                                                                            F/N-(O+("
                       IJ%'" I%&!"               IK!K"                        I$K!"     I$L$"                             IJ#I"     I$!I"                              IJL'"
              $!!!"                           IIJ$" IIJI"                                           I#%J"     I'!!"                           I#&#"       I#JK"                  II%I"      IIK!" I'$#"
                                      #'&L"                 ##J$"                                                                                                                              #$#K" #KJ!"
                                                                      '&#J"                                       'J%&"                                        '&LK"
              #!!!"

                 !"
                            ,"


                                      "


                                                "


                                                        3"


                                                                      6"


                                                                                "


                                                                                          "


                                                                                                    3"




                                                                                                                                      "


                                                                                                                                                2"


                                                                                                                                                              "


                                                                                                                                                                         "


                                                                                                                                                                                   G"


                                                                                                                                                                                               "
                                                                                                               B"




                                                                                                                                                                                                        "
                                                                                                                         "
                                   //


                                                 )




                                                                               3:


                                                                                        8/




                                                                                                                                       5




                                                                                                                                                            )8


                                                                                                                                                                         3*




                                                                                                                                                                                             8B


                                                                                                                                                                                                      9B
                                                                                                                        C
                         *+




                                                                                                                                    D;
                                                                                                  ?2
                                                       )5




                                                                                                              A;
                                                                    02




                                                                                                                                                                                     ?
                                              01




                                                                                                                                                 ,;




                                                                                                                                                                       ;B


                                                                                                                                                                                  /:
                                                                                                                      @9
                                 *.




                                                                                                                                                                                                     H
                                                                              95




                                                                                                                                                                                            3:
                                                                                      <=




                                                                                                                                                            *
                      ()




                                                       34




                                                                                                 >




                                                                                                                                              5,E


                                                                                                                                                        /?*
                                                                                                             @9
                                                                            78




                                                                                                                                                                  F<


                                                                                                                                                                                G8


                                                                                                                                                                                          89
                                 -,




                                                                                                7)
                                                     02




                                                                                       *




                                                                                                                                            =8
                                                                                    7;




                                                                                                                                                       F;




                                                                                                                                                                                         H
                                                                                                                                                                                                                                     20/22
Motivation                  Research question                        State-of-the-art                      Methodology                               Results                       Conclusions and Future Work




Understanding the results - Number of unique entities

                  Examining the number of entities in the source (Sc. DB, Sc. FB, Sc. DB-FB) and
                  target (TGT) datasets after pre-processing.
                  the TGT dataset consists of 1.73 ± 0.35 entities/tweet
                  the Sc.DB dataset consists of 22.24 ± 1.44 entities entities/article
                  the Sc.FB dataset consists of 8.14 ± 5.78 entities entities/article

             '#!!!"                                                                                                                                   '%&KI"
                                                                                                                     &I%JK"
             '!!!!"                                                                                                                                             &K&!!"
                                                                                           &LL'$"

             &#!!!"
                                                                                &$LJ%"
                                             %I$J&"                                                        %IIJK"                   %I&$!"                                        %III%"
             &!!!!"         %K!K#"
                                  %'#%'"        %#%&L" %'&'#" %#I!J"                %'#KK"                                                        %'KK$"
                                                                                                                                             %'#IL"                %'J&&"
             %#!!!"                          %&'#!"                                                                                                      %&L'$"
                               %%&K&" %%!IL"         %%%J'"        %%&K&" %%#K%" %%&K&" %%&K&"                                            %%&K&"                %%%&I" %%&K&"               GMG"
                        %$$#L"                              %!LI%"                             %!JII"
                                                                                                $I%&!"                                               $J'K&"                                 F/N-("
             %!!!!"                                                                                                          $KIK'"
                                                                                          $'L&L"                                                               $'$#&"                       F/N+("
             $#!!!"                                                                                                            $%!#&"                                                       F/N-(O+("
                                                                               IJ#$"
             $!!!!"                         KL%%"                                                         KIK%"                                                                  KIKK"
                          L'!$"                                      #K#J"
              #!!!" $II%"            %&$L"             $I$&" %&J#"                                %'%'"           %&$%"                    %''$"                       %J%I"
                                                                                                                                                                    %%JL"
                                  $JKK" $J!#"       $J!#" $JK&" $'K$"        $%J&"     $#I$"   $'#&" $ILI"                  $$%!"       $#$%" %$$L"       $KJ&"                $%!L"
                 !"
                            ,"


                                      "


                                            "


                                                      3"


                                                             6"


                                                                     "


                                                                               /"




                                                                                                                                            "


                                                                                                                                                     "
                                                                                        3"




                                                                                                                    "


                                                                                                                              2"




                                                                                                                                                              G"


                                                                                                                                                                          "
                                                                                                   B"




                                                                                                                                                                                 "
                                                                                                           "
                                   //


                                             )




                                                                    3:




                                                                                                                                          )8


                                                                                                                                                  3*
                                                                                                                     5




                                                                                                                                                                        8B


                                                                                                                                                                               9B
                                                                                                          C
                         *+




                                                                                                                  D;
                                                    )5




                                                                                      ?2
                                                                             =8




                                                                                               A;
                                                           02
                                          01




                                                                                                                                                              ?
                                                                                                                               ,;




                                                                                                                                                ;B


                                                                                                                                                           /:
                                                                                                        @9
                                 *.




                                                                    95




                                                                                                                                                                              H
                                                                                                                                                                     3:
                                                                                                                                          *
                      ()




                                                   34




                                                                                     >




                                                                                                                            5,E


                                                                                                                                      /?*
                                                                         *<




                                                                                               @9
                                                                  78




                                                                                                                                                F<


                                                                                                                                                         G8


                                                                                                                                                                   89
                                 -,




                                                                                     7)
                                                 02




                                                                                                                          =8
                                                                         7;




                                                                                                                                     F;




                                                                                                                                                                  H
                                                                                                                                                                                                            21/22
Exploring the similarity between Social Knowledge Sources and Twitter for Cross-Domain Topic Classification of Tweets #KECSM 2012 #ISWC2012

More Related Content

Viewers also liked

Asterid: Linked Data Asterisms
Asterid: Linked Data AsterismsAsterid: Linked Data Asterisms
Asterid: Linked Data AsterismsGregoire Burel
 
Smart Cities and E-governance
Smart Cities and E-governanceSmart Cities and E-governance
Smart Cities and E-governancesteveking1225
 
E governance
E governanceE governance
E governanceGoa App
 
Role of technology in SMART governance “Smart City, Safe City"
Role of technology in SMART governance “Smart City, Safe City"Role of technology in SMART governance “Smart City, Safe City"
Role of technology in SMART governance “Smart City, Safe City"KRITYANAND UNESCO CLUB Jamshedpur
 

Viewers also liked (7)

Mapping Keywords to
Mapping Keywords to Mapping Keywords to
Mapping Keywords to
 
Asterid: Linked Data Asterisms
Asterid: Linked Data AsterismsAsterid: Linked Data Asterisms
Asterid: Linked Data Asterisms
 
e-governance in India
e-governance in Indiae-governance in India
e-governance in India
 
Smart Cities and E-governance
Smart Cities and E-governanceSmart Cities and E-governance
Smart Cities and E-governance
 
E governance
E governanceE governance
E governance
 
Role of technology in SMART governance “Smart City, Safe City"
Role of technology in SMART governance “Smart City, Safe City"Role of technology in SMART governance “Smart City, Safe City"
Role of technology in SMART governance “Smart City, Safe City"
 
E governance
E governanceE governance
E governance
 

Exploring the similarity between Social Knowledge Sources and Twitter for Cross-Domain Topic Classification of Tweets #KECSM 2012 #ISWC2012

  • 1. Motivation Research question State-of-the-art Methodology Results Conclusions and Future Work Exploring the similarity between Social Knowledge Sources and Twitter for Cross-Domain Topic Classification of Tweets Andrea Varga, Amparo E. Cano and Fabio Ciravegna 1 Organisations Information and Knowledge (OAK) Research Group University of Sheffield 2 Knowledge Management Institute (KMI) Open University KECSM 2012/ISWC 2012 Nov 12, 2012 1/22
  • 2. Motivation Research question State-of-the-art Methodology Results Conclusions and Future Work Outline 1 Motivation 2 State-of-the-art 3 Methodology 4 Results 5 Conclusions and Future Work 2/22
  • 3. Motivation Research question State-of-the-art Methodology Results Conclusions and Future Work Why classifying Tweets into topics? Topic classification (TC) of tweets can be important for multiple application: Information Retrieval Recommendation Emergency responses, etc. Topic name Example tweets Disaster&Accident(DisAcc) happening accident people dying could phone ambulance wakakkaka xd Entertainment&Culture(EntCult) google adwords commercial greeeat en- joyed watching greeeeeat day Politics(Pol) quoting military source sk media reports deployed rocket launchers decoys real Sports(Sports) ravens good position games left browns bengals playoffs 3/22
  • 4. Motivation Research question State-of-the-art Methodology Results Conclusions and Future Work What are the challenges in Topic Classification (TC) of Tweets? Special characteristics of tweets the restricted size of a post (limited to 140 characters) the frequent use of misspellings and jargons the frequent use of abbreviations the use of non-standard English: reflected in vocabulary and writing style Topic name Example tweets Disaster&Accident(DisAcc) happening accident people dying could phone ambulance wakakkaka xd Entertainment&Culture(EntCult) google commercial greeeat enjoyed watching day Politics(Pol) quoting military source media reports de- ployed rocket launchers decoys real Sports(Sports) ravens good position games left browns bengals playoffs 4/22
  • 5. Motivation Research question State-of-the-art Methodology Results Conclusions and Future Work What are the challenges in Topic Classification (TC) of Tweets? Special characteristics of tweets the restricted size of a post (limited to 140 characters) the frequent use of misspellings and jargons the frequent use of abbreviations the use of non-standard English: reflected in vocabulary and writing style Topic name Example tweets Disaster&Accident(DisAcc) happening accident people dying could phone ambulance wakakkaka xd Entertainment&Culture(EntCult) google adwords commercial greeeat en- joyed watching greeeeeat day Politics(Pol) quoting military source media reports de- ployed rocket launchers decoys real Sports(Sports) ravens good position games left browns bengals playoffs 4/22
  • 6. Motivation Research question State-of-the-art Methodology Results Conclusions and Future Work What are the challenges in Topic Classification (TC) of Tweets? Special characteristics of tweets the restricted size of a post (limited to 140 characters) the frequent use of misspellings and jargons the frequent use of abbreviations the use of non-standard English: reflected in vocabulary and writing style Topic name Example tweets Disaster&Accident(DisAcc) happening accident people dying could phone ambulance wakakkaka xd Entertainment&Culture(EntCult) google adwords commercial greeeat en- joyed watching greeeeeat day Politics(Pol) quoting military source sk media reports deployed rocket launchers decoys real Sports(Sports) ravens good position games left browns bengals playoffs 4/22
  • 7. Motivation Research question State-of-the-art Methodology Results Conclusions and Future Work What are the challenges in Topic Classification (TC) of Tweets? Special characteristics of tweets the restricted size of a post (limited to 140 characters) the frequent use of misspellings and jargons the frequent use of abbreviations the use of non-standard English: reflected in vocabulary and writing style Topic name Example tweets Disaster&Accident(DisAcc) happening accident people dying could phone ambulance wakakkaka xd Entertainment&Culture(EntCult) google adwords commercial greeeat en- joyed watching greeeeeat day Politics(Pol) quoting military source sk media reports deployed rocket launchers decoys real Sports(Sports) ravens good position games left browns bengals playoffs 4/22
  • 8. Motivation Research question State-of-the-art Methodology Results Conclusions and Future Work What are the challenges in Topic Classification (TC) of Tweets? Special characteristics of tweets the restricted size of a post (limited to 140 characters) the frequent use of misspellings and jargons the frequent use of abbreviations the use of non-standard English: reflected in vocabulary and writing style Topic name Example tweets Disaster&Accident(DisAcc) happening accident people dying could phone ambulance wakakkaka xd Entertainment&Culture(EntCult) google adwords commercial greeeat en- joyed watching greeeeeat day Politics(Pol) quoting military source sk media reports deployed rocket launchers decoys real Sports(Sports) ravens good position games left browns bengals playoffs => These characteristics poses additional challenges for traditional supervised machine learning approaches for building accurate TC of tweets 4/22
  • 9. Motivation Research question State-of-the-art Methodology Results Conclusions and Future Work Why are Social Knowledge Sources (KS) relevant to Twitter? Data bottleneck problem: investigate an alternative approach inspired by domain adaptation/transfer learning for exploiting the information from Social Knowledge Sources (DBpedia and Freebase) for TC of Tweets Commonalities between KSs and Twitter they are constantly edited by web users they are social and built on a collaborative manner they cover a large number of topics 5/22
  • 10. Motivation Research question State-of-the-art Methodology Results Conclusions and Future Work Why are Social Knowledge Sources (KS) relevant to Twitter? Data bottleneck problem: investigate an alternative approach inspired by domain adaptation/transfer learning for exploiting the information from Social Knowledge Sources (DBpedia and Freebase) for TC of Tweets Commonalities between KSs and Twitter they are constantly edited by web users they are social and built on a collaborative manner they cover a large number of topics 5/22
  • 11. Motivation Research question State-of-the-art Methodology Results Conclusions and Future Work Why are Social Knowledge Sources (KS) relevant to Twitter? Data bottleneck problem: investigate an alternative approach inspired by domain adaptation/transfer learning for exploiting the information from Social Knowledge Sources (DBpedia and Freebase) for TC of Tweets Commonalities between KSs and Twitter they are constantly edited by web users they are social and built on a collaborative manner they cover a large number of topics 5/22
  • 12. Motivation Research question State-of-the-art Methodology Results Conclusions and Future Work Why are Social Knowledge Sources (KS) relevant to Twitter? Data bottleneck problem: investigate an alternative approach inspired by domain adaptation/transfer learning for exploiting the information from Social Knowledge Sources (DBpedia and Freebase) for TC of Tweets Commonalities between KSs and Twitter they are constantly edited by web users they are social and built on a collaborative manner they cover a large number of topics 5/22
  • 13. Motivation Research question State-of-the-art Methodology Results Conclusions and Future Work Why are Social Knowledge Sources (KS) relevant to Twitter? Data bottleneck problem: investigate an alternative approach inspired by domain adaptation/transfer learning for exploiting the information from Social Knowledge Sources (DBpedia and Freebase) for TC of Tweets Commonalities between KSs and Twitter they are constantly edited by web users they are social and built on a collaborative manner they cover a large number of topics More importantly: KSs contain a large number of annotated data on a large number of topics 5/22
  • 14. Motivation Research question State-of-the-art Methodology Results Conclusions and Future Work Research questions 1 Are KSs relevant for topic classification of Tweets? 2 Which features make the KSs look more similar to Twitter? 3 How similar or dissimilar are KSs to Twitter? Which similarity measure does better quantify the lexical changes between KSs and Twitter? 6/22
  • 15. Motivation Research question State-of-the-art Methodology Results Conclusions and Future Work State-of-the-art approaches for TC of Tweets Using DBpedia for Topic Classification of Tweets: Wikify (Mihalcea, R. and Csomai, A., 2007) Enriching unstructured text with Wikipedia links (D. Milne and I. H. Witten, 2008) Tagme (P. Ferragina and U. Scaiella., 2010) Topical Social Sensor (P. K. P. N. Mendes et al., 2010) Vector space model (Oscar Munoz-Garcia et al. 2011) Using Freebase for Topic Classification of Tweets: Clustering based approach (S.P.Kasiviswanathan et al., 2011) Our main contribution: Understanding the similarity between KSs and Twitter Exploring multiple KSs (DBpedia + Freebase) Investigating various statistical metrics for quantifying the similarity between KSs and Twitter 7/22
  • 16. Motivation Research question State-of-the-art Methodology Results Conclusions and Future Work Methodology followed 8/22
  • 17. Motivation Research question State-of-the-art Methodology Results Conclusions and Future Work Methodology followed 1 Collecting Data from KSs Sc. DB Sc. FB Sc. DB-FB Retrieve articles Retrieve tweets Concept Concept enrichment enrichment Build Cross- Annotate Tweets domain Classifier 8/22
  • 18. Motivation Research question State-of-the-art Methodology Results Conclusions and Future Work Methodology followed 1 Collecting Data from KSs 2 Building Cross-Domain (CD) Topic Classifier of Tweets Sc. DB Sc. FB Sc. DB-FB Retrieve articles Retrieve tweets Concept Concept enrichment enrichment Build Cross- Annotate Tweets domain Classifier 8/22
  • 19. Motivation Research question State-of-the-art Methodology Results Conclusions and Future Work Methodology followed 1 Collecting Data from KSs 2 Building Cross-Domain (CD) Topic Classifier of Tweets 3 Measuring Distributional Changes Between KSs and Twitter Sc. DB Sc. FB Sc. DB-FB Retrieve articles Retrieve tweets Concept Concept enrichment enrichment Build Cross- Annotate Tweets domain Classifier 8/22
  • 20. Motivation Research question State-of-the-art Methodology Results Conclusions and Future Work Step 1: Collecting Data from KSs Twitter corpus collected in Abel et al. (2011), tweets posted between October 2010 and Twitter multilabel frequency January 2011, annotated with 17 topics Random selection of 1,000 articles/tweets from DBpedia/Freebase/Twitter for each topic => 9,465 articles from DBpedia; 16,915Freebase multilabel frequency and 12,412 tweets Dbpedia multilabel frequency articles from Freebase; Preprocessing: removal of hastags, mentions and URLs from tweets; taking top-1000 71% features for each topic Dbpedia multilabel frequency Freebase multilabel frequency Twitter multilabel frequency 71% 88.6% 0.1% 1% 88.6% 99.9% 0.1% 0.9% 1.8% 5.6% 0.9% 99.9% 0.1% 0.1% 1% 1.8% 8.6% 5.6% 8.6% 22.3% 22.3% 1 8 2 3+4+5+6+7+9 1 2 1 1 2 2 3 3 4 4 6+5 6+5 1 8 2 3+4+5+6+7+9 1 2 9/22
  • 21. Motivation Research question State-of-the-art Methodology Results Conclusions and Future Work Step 1: Collecting Data from KSs Business_Finance Disaster_Accident Education Entertainment Environment Health Human Interest Labor Law_Crime Technology_IT Religion Social Issues Weather Sports War_Conflict Politics Retrieval of articles for a given topic (e.g. Politics): from DBpedia: executing SPARQL queries for retrieving category names containing the topic name: Category:Politics_of_the_United_States Category:National_Democratic_Party_Egypt_politicians etc. from Freebase: accessing Text Service API for articles belonging to the topic: for underspecified topics/domains: consider articles containing the topic in their titles 10/22
  • 22. Motivation Research question State-of-the-art Methodology Results Conclusions and Future Work Step 2: Building Cross-Domain (CD) Topic Classifier of Tweets Considering two different feature sets: 11/22
  • 23. Motivation Research question State-of-the-art Methodology Results Conclusions and Future Work Step 2: Building Cross-Domain (CD) Topic Classifier of Tweets Considering two different feature sets: BOW: tf.idf value of the words present the examples (articles or tweets) 11/22
  • 24. Motivation Research question State-of-the-art Methodology Results Conclusions and Future Work Step 2: Building Cross-Domain (CD) Topic Classifier of Tweets Considering two different feature sets: BOW: tf.idf value of the words present the examples (articles or tweets) BOE: tf.idf value of the words and entity+concept pairs present the examples (articles or tweets) Sc. DB Sc. FB Sc. DB-FB Retrieve articles Retrieve tweets Concept Concept enrichment enrichment Build Cross- Annotate Tweets domain Classifier 11/22
  • 25. Motivation Research question State-of-the-art Methodology Results Conclusions and Future Work Step 3: Measuring Distributional Changes Between KSs and Twitter − → Building a vector ds for each the source dataset (Sc.DB, Sc.Fb, − → Sc.Db-FB) and a vector dt for the target dataset (Twitter) consisting of the TF-IDF weight for either the BoW or BoE feature sets statistical measures applied: (O−E)2 χ2 test: χ2 = E , where O is the observed value for a feature, while E is the expected value calculated on the basis of the joint corpus Kullback-Leibler symmetric distance: − − → → − → − → → − ds (f ) KL(ds || dt ) = f ∈F ∪FT (ds (f ) − dt (f )) log → − S dt (f ) FS ∪FT → − → − − − → → k =1 ( ds (fS )× ds (fT )) k k cosine similarity: cosine(ds , dt ) = FS ∪FT → − → − ( ds (fS )) 2 × FS ∪FT ( d (f ))2 k =1 k k =1 kt T 12/22
  • 26. Motivation Research question State-of-the-art Methodology Results Conclusions and Future Work Experimental setting 1-vs-all approach, building individual CD classifier for each topic, SVM classifiers, performed 5 cross-fold validation Sc-Db, Sc-Fb, Sc-Db-Fb classifiers trained on full KS data, evaluated on 20% Twitter data 2,482 tweets) TGT classifier: trained on 80% Twitter data, evaluated on 20% Twitter data (2,482 tweets) 13/22
  • 27. Motivation Research question State-of-the-art Methodology Results Conclusions and Future Work Findings -Classification performance in F1 measure Q1 : Which KS reflects better the lexical variation in Twitter? 14/22
  • 28. Motivation Research question State-of-the-art Methodology Results Conclusions and Future Work Findings -Classification performance in F1 measure Q1 : Which KS reflects better the lexical variation in Twitter? 21.40 21.20 11.50 15.30 18.60 19.10 47.50 46.50 BusFi 15.30 15.10 15.50 16.20 19.50 20.20 42.20 43.10 EntCult 23.40 25.10 14.40 14.70 21.00 20.40 58.40 58.50 Religion 28.60 30.30 25.60 24.70 26.90 25.40 51.70 51.80 Health 22.20 21.00 27.80 26.80 24.20 26.80 45.10 44.90 Pol 0.90 2.70 16.80 17.80 14.80 13.30 46.80 46.40 Law 1.40 2.30 17.20 19.50 11.30 13.90 41.60 42.60 HospRecr 1.30 2.00 8.80 9.00 9.70 9.10 44.20 44.00 SocIssue 8.30 9.70 14.50 14.50 21.00 21.10 57.50 59.20 DisAcc 1.60 2.40 18.40 18.60 12.40 9.90 57.40 58.00 TechIT 15.20 14.20 2.20 2.20 8.90 8.40 46.60 48.30 Env 1.10 1.40 2.00 1.60 1.50 2.20 33.60 34.20 HumInt 3.10 7.00 39.80 39.90 36.70 36.00 81.20 81.50 Weather 1.40 1.30 31.90 31.90 30.10 29.90 79.90 79.40 Labor 9.30 10.90 23.90 23.60 24.60 25.70 67.60 72.70 War 10.30 11.70 26.50 26.20 26.20 26.00 60.10 59.20 Sports 37.40 37.80 42.50 47.20 42.50 45.70 71.90 71.30 Edu Sc.DB(BoE) Sc.DB(BoW) Sc.FB(BoE) SC.FB(BoW) Sc.DB−FB(BoE) SC.DB−FB(BoW) TGT(BoW) TGT(BoE) 14/22
  • 29. Motivation Research question State-of-the-art Methodology Results Conclusions and Future Work Findings -Classification performance in F1 measure Q1 : Which KS reflects better the lexical variation in Twitter? 21.40 21.20 11.50 15.30 18.60 19.10 47.50 46.50 BusFi 15.30 15.10 15.50 16.20 19.50 20.20 42.20 43.10 EntCult 23.40 25.10 14.40 14.70 21.00 20.40 58.40 58.50 Religion 28.60 30.30 25.60 24.70 26.90 25.40 51.70 51.80 Health 22.20 21.00 27.80 26.80 24.20 26.80 45.10 44.90 Pol 0.90 2.70 16.80 17.80 14.80 13.30 46.80 46.40 Law 1.40 2.30 17.20 19.50 11.30 13.90 41.60 42.60 HospRecr 1.30 2.00 8.80 9.00 9.70 9.10 44.20 44.00 SocIssue 8.30 9.70 14.50 14.50 21.00 21.10 57.50 59.20 DisAcc 1.60 2.40 18.40 18.60 12.40 9.90 57.40 58.00 TechIT 15.20 14.20 2.20 2.20 8.90 8.40 46.60 48.30 Env 1.10 1.40 2.00 1.60 1.50 2.20 33.60 34.20 HumInt 3.10 7.00 39.80 39.90 36.70 36.00 81.20 81.50 Weather 1.40 1.30 31.90 31.90 30.10 29.90 79.90 79.40 Labor 9.30 10.90 23.90 23.60 24.60 25.70 67.60 72.70 War 10.30 11.70 26.50 26.20 26.20 26.00 60.10 59.20 Sports 37.40 37.80 42.50 47.20 42.50 45.70 71.90 71.30 Edu Sc.DB(BoE) Sc.DB(BoW) Sc.FB(BoE) SC.FB(BoW) Sc.DB−FB(BoE) SC.DB−FB(BoW) TGT(BoW) TGT(BoE) 14/22
  • 30. Motivation Research question State-of-the-art Methodology Results Conclusions and Future Work Findings -Classification performance in F1 measure Q1 : Which KS reflects better the lexical variation in Twitter? 21.40 21.20 11.50 15.30 18.60 19.10 47.50 46.50 BusFi 15.30 15.10 15.50 16.20 19.50 20.20 42.20 43.10 EntCult 23.40 25.10 14.40 14.70 21.00 20.40 58.40 58.50 Religion 28.60 30.30 25.60 24.70 26.90 25.40 51.70 51.80 Health 22.20 21.00 27.80 26.80 24.20 26.80 45.10 44.90 Pol 0.90 2.70 16.80 17.80 14.80 13.30 46.80 46.40 Law 1.40 2.30 17.20 19.50 11.30 13.90 41.60 42.60 HospRecr 1.30 2.00 8.80 9.00 9.70 9.10 44.20 44.00 SocIssue 8.30 9.70 14.50 14.50 21.00 21.10 57.50 59.20 DisAcc 1.60 2.40 18.40 18.60 12.40 9.90 57.40 58.00 TechIT 15.20 14.20 2.20 2.20 8.90 8.40 46.60 48.30 Env 1.10 1.40 2.00 1.60 1.50 2.20 33.60 34.20 HumInt 3.10 7.00 39.80 39.90 36.70 36.00 81.20 81.50 Weather 1.40 1.30 31.90 31.90 30.10 29.90 79.90 79.40 Labor 9.30 10.90 23.90 23.60 24.60 25.70 67.60 72.70 War 10.30 11.70 26.50 26.20 26.20 26.00 60.10 59.20 Sports 37.40 37.80 42.50 47.20 42.50 45.70 71.90 71.30 Edu Sc.DB(BoE) Sc.DB(BoW) Sc.FB(BoE) SC.FB(BoW) Sc.DB−FB(BoE) SC.DB−FB(BoW) TGT(BoW) TGT(BoE) 14/22
  • 31. Motivation Research question State-of-the-art Methodology Results Conclusions and Future Work Findings -Classification performance in F1 measure Q1 : Which KS reflects better the lexical variation in Twitter? 21.40 21.20 11.50 15.30 18.60 19.10 47.50 46.50 BusFi 15.30 15.10 15.50 16.20 19.50 20.20 42.20 43.10 EntCult 23.40 25.10 14.40 14.70 21.00 20.40 58.40 58.50 Religion 28.60 30.30 25.60 24.70 26.90 25.40 51.70 51.80 Health 22.20 21.00 27.80 26.80 24.20 26.80 45.10 44.90 Pol 0.90 2.70 16.80 17.80 14.80 13.30 46.80 46.40 Law 1.40 2.30 17.20 19.50 11.30 13.90 41.60 42.60 HospRecr 1.30 2.00 8.80 9.00 9.70 9.10 44.20 44.00 SocIssue 8.30 9.70 14.50 14.50 21.00 21.10 57.50 59.20 DisAcc 1.60 2.40 18.40 18.60 12.40 9.90 57.40 58.00 TechIT 15.20 14.20 2.20 2.20 8.90 8.40 46.60 48.30 Env 1.10 1.40 2.00 1.60 1.50 2.20 33.60 34.20 HumInt 3.10 7.00 39.80 39.90 36.70 36.00 81.20 81.50 Weather 1.40 1.30 31.90 31.90 30.10 29.90 79.90 79.40 Labor 9.30 10.90 23.90 23.60 24.60 25.70 67.60 72.70 War 10.30 11.70 26.50 26.20 26.20 26.00 60.10 59.20 Sports 37.40 37.80 42.50 47.20 42.50 45.70 71.90 71.30 Edu Sc.DB(BoE) Sc.DB(BoW) Sc.FB(BoE) SC.FB(BoW) Sc.DB−FB(BoE) SC.DB−FB(BoW) TGT(BoW) TGT(BoE) 14/22
  • 32. Motivation Research question State-of-the-art Methodology Results Conclusions and Future Work Findings -Classification performance in F1 measure Q1 : Which KS reflects better the lexical variation in Twitter? Sc.Db-FB showed best performance, followed by Sc.Fb and Sc.Db 21.40 21.20 11.50 15.30 18.60 19.10 47.50 46.50 BusFi 15.30 15.10 15.50 16.20 19.50 20.20 42.20 43.10 EntCult 23.40 25.10 14.40 14.70 21.00 20.40 58.40 58.50 Religion 28.60 30.30 25.60 24.70 26.90 25.40 51.70 51.80 Health 22.20 21.00 27.80 26.80 24.20 26.80 45.10 44.90 Pol 0.90 2.70 16.80 17.80 14.80 13.30 46.80 46.40 Law 1.40 2.30 17.20 19.50 11.30 13.90 41.60 42.60 HospRecr 1.30 2.00 8.80 9.00 9.70 9.10 44.20 44.00 SocIssue 8.30 9.70 14.50 14.50 21.00 21.10 57.50 59.20 DisAcc 1.60 2.40 18.40 18.60 12.40 9.90 57.40 58.00 TechIT 15.20 14.20 2.20 2.20 8.90 8.40 46.60 48.30 Env 1.10 1.40 2.00 1.60 1.50 2.20 33.60 34.20 HumInt 3.10 7.00 39.80 39.90 36.70 36.00 81.20 81.50 Weather 1.40 1.30 31.90 31.90 30.10 29.90 79.90 79.40 Labor 9.30 10.90 23.90 23.60 24.60 25.70 67.60 72.70 War 10.30 11.70 26.50 26.20 26.20 26.00 60.10 59.20 Sports 37.40 37.80 42.50 47.20 42.50 45.70 71.90 71.30 Edu Sc.DB(BoE) Sc.DB(BoW) Sc.FB(BoE) SC.FB(BoW) Sc.DB−FB(BoE) SC.DB−FB(BoW) TGT(BoW) TGT(BoE) 15/22
  • 33. Motivation Research question State-of-the-art Methodology Results Conclusions and Future Work Findings -Classification performance in F1 measure Q1 : Which KS reflects better the lexical variation in Twitter? Sc.Db-FB showed best performance, followed by Sc.Fb and Sc.Db 21.40 21.20 11.50 15.30 18.60 19.10 47.50 46.50 BusFi 15.30 15.10 15.50 16.20 19.50 20.20 42.20 43.10 EntCult 23.40 25.10 14.40 14.70 21.00 20.40 58.40 58.50 Religion 28.60 30.30 25.60 24.70 26.90 25.40 51.70 51.80 Health 22.20 21.00 27.80 26.80 24.20 26.80 45.10 44.90 Pol 0.90 2.70 16.80 17.80 14.80 13.30 46.80 46.40 Law 1.40 2.30 17.20 19.50 11.30 13.90 41.60 42.60 HospRecr 1.30 2.00 8.80 9.00 9.70 9.10 44.20 44.00 SocIssue 8.30 9.70 14.50 14.50 21.00 21.10 57.50 59.20 DisAcc 1.60 2.40 18.40 18.60 12.40 9.90 57.40 58.00 TechIT 15.20 14.20 2.20 2.20 8.90 8.40 46.60 48.30 Env 1.10 1.40 2.00 1.60 1.50 2.20 33.60 34.20 HumInt 3.10 7.00 39.80 39.90 36.70 36.00 81.20 81.50 Weather 1.40 1.30 31.90 31.90 30.10 29.90 79.90 79.40 Labor 9.30 10.90 23.90 23.60 24.60 25.70 67.60 72.70 War 10.30 11.70 26.50 26.20 26.20 26.00 60.10 59.20 Sports 37.40 37.80 42.50 47.20 42.50 45.70 71.90 71.30 Edu Sc.DB(BoE) Sc.DB(BoW) Sc.FB(BoE) SC.FB(BoW) Sc.DB−FB(BoE) SC.DB−FB(BoW) TGT(BoW) TGT(BoE) 15/22
  • 34. Motivation Research question State-of-the-art Methodology Results Conclusions and Future Work Findings -Classification performance in F1 measure Q1 : Which KS reflects better the lexical variation in Twitter? Sc.Db-FB showed best performance, followed by Sc.Fb and Sc.Db 21.40 21.20 11.50 15.30 18.60 19.10 47.50 46.50 BusFi 15.30 15.10 15.50 16.20 19.50 20.20 42.20 43.10 EntCult 23.40 25.10 14.40 14.70 21.00 20.40 58.40 58.50 Religion 28.60 30.30 25.60 24.70 26.90 25.40 51.70 51.80 Health 22.20 21.00 27.80 26.80 24.20 26.80 45.10 44.90 Pol 0.90 2.70 16.80 17.80 14.80 13.30 46.80 46.40 Law 1.40 2.30 17.20 19.50 11.30 13.90 41.60 42.60 HospRecr 1.30 2.00 8.80 9.00 9.70 9.10 44.20 44.00 SocIssue 8.30 9.70 14.50 14.50 21.00 21.10 57.50 59.20 DisAcc 1.60 2.40 18.40 18.60 12.40 9.90 57.40 58.00 TechIT 15.20 14.20 2.20 2.20 8.90 8.40 46.60 48.30 Env 1.10 1.40 2.00 1.60 1.50 2.20 33.60 34.20 HumInt 3.10 7.00 39.80 39.90 36.70 36.00 81.20 81.50 Weather 1.40 1.30 31.90 31.90 30.10 29.90 79.90 79.40 Labor 9.30 10.90 23.90 23.60 24.60 25.70 67.60 72.70 War 10.30 11.70 26.50 26.20 26.20 26.00 60.10 59.20 Sports 37.40 37.80 42.50 47.20 42.50 45.70 71.90 71.30 Edu Sc.DB(BoE) Sc.DB(BoW) Sc.FB(BoE) SC.FB(BoW) Sc.DB−FB(BoE) SC.DB−FB(BoW) TGT(BoW) TGT(BoE) 15/22
  • 35. Motivation Research question State-of-the-art Methodology Results Conclusions and Future Work Findings -Classification performance in F1 measure Q2 : What feature makes the KSs look more similar to Twitter? BoW features were found better than BoE for CD classifiers BoE features were found better than BoW for TGT 21.40 21.20 11.50 15.30 18.60 19.10 47.50 46.50 BusFi 15.30 15.10 15.50 16.20 19.50 20.20 42.20 43.10 EntCult 23.40 25.10 14.40 14.70 21.00 20.40 58.40 58.50 Religion 28.60 30.30 25.60 24.70 26.90 25.40 51.70 51.80 Health 22.20 21.00 27.80 26.80 24.20 26.80 45.10 44.90 Pol 0.90 2.70 16.80 17.80 14.80 13.30 46.80 46.40 Law 1.40 2.30 17.20 19.50 11.30 13.90 41.60 42.60 HospRecr 1.30 2.00 8.80 9.00 9.70 9.10 44.20 44.00 SocIssue 8.30 9.70 14.50 14.50 21.00 21.10 57.50 59.20 DisAcc 1.60 2.40 18.40 18.60 12.40 9.90 57.40 58.00 TechIT 15.20 14.20 2.20 2.20 8.90 8.40 46.60 48.30 Env 1.10 1.40 2.00 1.60 1.50 2.20 33.60 34.20 HumInt 3.10 7.00 39.80 39.90 36.70 36.00 81.20 81.50 Weather 1.40 1.30 31.90 31.90 30.10 29.90 79.90 79.40 Labor 9.30 10.90 23.90 23.60 24.60 25.70 67.60 72.70 War 10.30 11.70 26.50 26.20 26.20 26.00 60.10 59.20 Sports 37.40 37.80 42.50 47.20 42.50 45.70 71.90 71.30 Edu Sc.DB(BoE) Sc.DB(BoW) Sc.FB(BoE) SC.FB(BoW) Sc.DB−FB(BoE) SC.DB−FB(BoW) TGT(BoW) TGT(BoE) 16/22
  • 36. Motivation Research question State-of-the-art Methodology Results Conclusions and Future Work Findings -Examining the number of annotation needed for Twitter classifier to outperform Sc. Db-FB Investigated the impact of employing Sc. Db-FB classifier over the Twitter classifier in terms of number of annotations The performance of the Twitter classifier against the three CD classifiers over the full learning curve 17/22
  • 37. Motivation Research question State-of-the-art Methodology Results Conclusions and Future Work Findings -Examining the number of annotation needed for Twitter classifier to outperform Sc. Db-FB Investigated the impact of employing Sc. Db-FB classifier over the Twitter classifier in terms of number of annotations The performance of the Twitter classifier against the three CD classifiers over the full learning curve => In the absence of any annotated tweets, applying these CD classifiers are beneficial 17/22
  • 38. Motivation Research question State-of-the-art Methodology Results Conclusions and Future Work Findings -Examining the number of annotation needed for Twitter classifier to outperform the CD classifiers Q3 : How similar or dissimilar are KSs to Twitter posts; and which similarity measure does better reflect the lexical changes between KSs and Twitter posts? Compared χ2 , KL-divergence, cosine for each topic χ2 obtained the best correlation with the performance of CD classifiers, achived scores >70% for 32 cases cosine obtained correlation scores >70% for 25 cases KL obtained correlation scores >70% for 24 cases 18/22
  • 39. Motivation Research question State-of-the-art Methodology Results Conclusions and Future Work Findings -Examining the number of annotation needed for Twitter classifier to outperform the CD classifiers Q3 : How similar or dissimilar are KSs to Twitter posts; and which similarity measure does better reflect the lexical changes between KSs and Twitter posts? Compared χ2 , KL-divergence, cosine for each topic χ2 obtained the best correlation with the performance of CD classifiers, achived scores >70% for 32 cases cosine obtained correlation scores >70% for 25 cases KL obtained correlation scores >70% for 24 cases => χ2 test is the best measure for quantifying the distributional differences between KSs and Twitter. 18/22
  • 40. Motivation Research question State-of-the-art Methodology Results Conclusions and Future Work Conclusions and Future Work We presented a first study towards understanding the usefulness of KSs in TC of tweets at various granularities: lexical features (BoW) and entity features (BoE) Our main findings are: 19/22
  • 41. Motivation Research question State-of-the-art Methodology Results Conclusions and Future Work Conclusions and Future Work We presented a first study towards understanding the usefulness of KSs in TC of tweets at various granularities: lexical features (BoW) and entity features (BoE) Our main findings are: In the absence of any annotated tweets, applying these CD classifiers are beneficial 19/22
  • 42. Motivation Research question State-of-the-art Methodology Results Conclusions and Future Work Conclusions and Future Work We presented a first study towards understanding the usefulness of KSs in TC of tweets at various granularities: lexical features (BoW) and entity features (BoE) Our main findings are: In the absence of any annotated tweets, applying these CD classifiers are beneficial Out of the two KSs, Freebase topics seem to be much closer to the Twitter topics than the DBpedia topics. 19/22
  • 43. Motivation Research question State-of-the-art Methodology Results Conclusions and Future Work Conclusions and Future Work We presented a first study towards understanding the usefulness of KSs in TC of tweets at various granularities: lexical features (BoW) and entity features (BoE) Our main findings are: In the absence of any annotated tweets, applying these CD classifiers are beneficial Out of the two KSs, Freebase topics seem to be much closer to the Twitter topics than the DBpedia topics. The two KSs contain complementary information 19/22
  • 44. Motivation Research question State-of-the-art Methodology Results Conclusions and Future Work Conclusions and Future Work We presented a first study towards understanding the usefulness of KSs in TC of tweets at various granularities: lexical features (BoW) and entity features (BoE) Our main findings are: In the absence of any annotated tweets, applying these CD classifiers are beneficial Out of the two KSs, Freebase topics seem to be much closer to the Twitter topics than the DBpedia topics. The two KSs contain complementary information For the CD classifiers, on average BOW features were more useful than BoE features 19/22
  • 45. Motivation Research question State-of-the-art Methodology Results Conclusions and Future Work Conclusions and Future Work We presented a first study towards understanding the usefulness of KSs in TC of tweets at various granularities: lexical features (BoW) and entity features (BoE) Our main findings are: In the absence of any annotated tweets, applying these CD classifiers are beneficial Out of the two KSs, Freebase topics seem to be much closer to the Twitter topics than the DBpedia topics. The two KSs contain complementary information For the CD classifiers, on average BOW features were more useful than BoE features For the Twitter classifiers, on average BOE features were more useful than BoW features 19/22
  • 46. Motivation Research question State-of-the-art Methodology Results Conclusions and Future Work Conclusions and Future Work We presented a first study towards understanding the usefulness of KSs in TC of tweets at various granularities: lexical features (BoW) and entity features (BoE) Our main findings are: In the absence of any annotated tweets, applying these CD classifiers are beneficial Out of the two KSs, Freebase topics seem to be much closer to the Twitter topics than the DBpedia topics. The two KSs contain complementary information For the CD classifiers, on average BOW features were more useful than BoE features For the Twitter classifiers, on average BOE features were more useful than BoW features We found χ2 test as being the best measure for quantifying the distributional differences between KSs and Twitter. 19/22
  • 47. Motivation Research question State-of-the-art Methodology Results Conclusions and Future Work Conclusions and Future Work We presented a first study towards understanding the usefulness of KSs in TC of tweets at various granularities: lexical features (BoW) and entity features (BoE) Our main findings are: In the absence of any annotated tweets, applying these CD classifiers are beneficial Out of the two KSs, Freebase topics seem to be much closer to the Twitter topics than the DBpedia topics. The two KSs contain complementary information For the CD classifiers, on average BOW features were more useful than BoE features For the Twitter classifiers, on average BOE features were more useful than BoW features We found χ2 test as being the best measure for quantifying the distributional differences between KSs and Twitter. Our future work will focus on building more accurate TC classifiers and investigating better measures 19/22
  • 48. Motivation Research question State-of-the-art Methodology Results Conclusions and Future Work Corpus Analysis - Size of vocabulary '%!!!" 'J!I'" 'I&'L" '$!!!" 'I!$L" 'IIIJ" '#$#J" '#J!K" ''L%L" ''L$I" '#!!!" ''J#I" '''!!" '!K#'" '!$IJ" '!!L#" '!#'#" '!!I'" '!!#L" LLII" LJJ$" LI!K" L&#I" LK'&" L&KJ" '!!!!" LI$#" L##'" LI$#" L$##" LI$#" LI$#" LI$#" LI$#" &K%&" &&J&" &'&I" GMG" &!!!" K$'#" %JK'" %&$#" %K#&" F/N-(" %$'#" JK%#" J&&&" %!!!" F/N+(" $%&$" $JJ'" I&!I" I&%!" F/N-(O+(" IJ%'" I%&!" IK!K" I$K!" I$L$" IJ#I" I$!I" IJL'" $!!!" IIJ$" IIJI" I#%J" I'!!" I#&#" I#JK" II%I" IIK!" I'$#" #'&L" ##J$" #$#K" #KJ!" '&#J" 'J%&" '&LK" #!!!" !" ," " " 3" 6" " " 3" " 2" " " G" " B" " " // ) 3: 8/ 5 )8 3* 8B 9B C *+ D; ?2 )5 A; 02 ? 01 ,; ;B /: @9 *. H 95 3: <= * () 34 > 5,E /?* @9 78 F< G8 89 -, 7) 02 * =8 7; F; H 20/22
  • 49. Motivation Research question State-of-the-art Methodology Results Conclusions and Future Work Understanding the results - Number of unique entities Examining the number of entities in the source (Sc. DB, Sc. FB, Sc. DB-FB) and target (TGT) datasets after pre-processing. the TGT dataset consists of 1.73 ± 0.35 entities/tweet the Sc.DB dataset consists of 22.24 ± 1.44 entities entities/article the Sc.FB dataset consists of 8.14 ± 5.78 entities entities/article '#!!!" '%&KI" &I%JK" '!!!!" &K&!!" &LL'$" &#!!!" &$LJ%" %I$J&" %IIJK" %I&$!" %III%" &!!!!" %K!K#" %'#%'" %#%&L" %'&'#" %#I!J" %'#KK" %'KK$" %'#IL" %'J&&" %#!!!" %&'#!" %&L'$" %%&K&" %%!IL" %%%J'" %%&K&" %%#K%" %%&K&" %%&K&" %%&K&" %%%&I" %%&K&" GMG" %$$#L" %!LI%" %!JII" $I%&!" $J'K&" F/N-(" %!!!!" $KIK'" $'L&L" $'$#&" F/N+(" $#!!!" $%!#&" F/N-(O+(" IJ#$" $!!!!" KL%%" KIK%" KIKK" L'!$" #K#J" #!!!" $II%" %&$L" $I$&" %&J#" %'%'" %&$%" %''$" %J%I" %%JL" $JKK" $J!#" $J!#" $JK&" $'K$" $%J&" $#I$" $'#&" $ILI" $$%!" $#$%" %$$L" $KJ&" $%!L" !" ," " " 3" 6" " /" " " 3" " 2" G" " B" " " // ) 3: )8 3* 5 8B 9B C *+ D; )5 ?2 =8 A; 02 01 ? ,; ;B /: @9 *. 95 H 3: * () 34 > 5,E /?* *< @9 78 F< G8 89 -, 7) 02 =8 7; F; H 21/22