SlideShare una empresa de Scribd logo
1 de 28
Comparing user generated content
published in different social media sources
Óscar Muñoz-García, Carlos Navarro

@NLP can u tag #user_generated_content ?! via lrec-conf.org

26 May 2012
Introduction




 The growth of social media has populated the Web with valuable
      UGC that can be exploited for many interesting purposes
             E.g. explaining or predicting real world outcomes through opinion
              mining

 Advertising companies use social media content for market research
    By mining users’ interests for focusing advertisement actions
    By obtaining the opinion of customers about brands


 NLP lets us automatizing social media content analysis
    However, UGC presents differences on text quality w.r.t. content
     source (e.g., Blogs vs. Twitter)
    Such differences challenge existing NLP techniques



Comparing user generated content published in different social media sources ⎢2
Introduction



     We show the differences of the language used in UGC w.r.t. social media sources
        By analysing the distribution of PoS categories on different sources
     We evaluate the performance of three NLP techniques
        Language Identification
        Sentiment Analysis
        Topic Identification
     Social media sources analysed
             Blogs (e.g., Wordpress and Blogger posts)
             Forums
             Microblogs (e.g., Twitter)
             Social networks (e.g., Facebook, Google+, MySpace, LinkedIn and Xing)
             Review Sites (e.g., Ciao and Dooyoo)
             Audio-visual content publishing sites (e.g., Youtube and Vimeo)
             News publishing sites (i.e., mainstream media)
             Other sites



Comparing user generated content published in different social media sources ⎢3
Comparing user generated content published in different social media sources


Distribution of PoS categories
Distribution of PoS categories




 Content analysed
   Corpora with 10,000 posts extracted from heterogeneous SM sources
      l written in Spanish
      l related to telecommunications domain
 The distribution has been obtained by using an automatic tagger
   Tools used:
      l  PoS tagging:
                            TreeTagger [Schmid, 1994] with a Spanish parameterisation
                l   Annotation pipeline:
                            GATE [Cunningham et al., 2011]

 Categories identified
   Main: noun, adjective, adverb, determiner, conjunction, pronoun, verb, …
   Secondary: common noun, proper noun, negation adverb, personal pronoun, …

       Helmut Schmid. 1994. Probabilistic part-of-speech tagging using decision trees. In Proceedings of International Conference on New Methods in
       Language Processing, Manchester, UK.

       Hamish Cunningham, Diana Maynard , Kalina Bontcheva et al. 2011. Text Processing with GATE (Version 6). University of Sheffield. Department of
       Computer Science, April.
Comparing user generated content published in different social media sources ⎢5
Distribution of PoS categories


      Microblogs: determiners and prepositions are used to a lesser extent
        Limitation of length (140 characters)
        Posts need to be written more concisely → Meaningless grammatical categories
          tend to be used less
                                                                                                                            Social
                                 News              Blogs           Video           Reviews   Microblogs   Forums   Other
                                                                                                                           networks
        Nouns                     31%               30%             29%             23%         34%        22%     27%       33%
      Adjectives                   9%                8%              6%              8%         9%         7%       8%       6%
       Adverbs                     2%                3%              3%              5%         4%         4%       4%       3%
     Determiners                  11%               10%              8%              8%         6%         8%       9%       7%
     Conjunctions                  6%                8%              7%             10%         6%         10%      9%       7%
      Pronouns                     2%                3%              5%              6%         5%         6%       4%       4%
     Prepositions                 15%               15%             12%             13%         8%         12%     13%       11%
Punctuaction marks                11%                8%             13%              9%         8%         9%      10%       11%
         Verbs                    12%               14%             17%             18%         19%        21%     16%       16%
  Other particles                  1%                1%              1%              1%         1%         1%       1%       1%

 Comparing user generated content published in different social media sources ⎢6
Distribution of PoS categories


      News and blogs present similar distributions
        Because of similar writing styles
        No limitations on the size of posts


                                                                                                                            Social
                                 News              Blogs           Video           Reviews   Microblogs   Forums   Other
                                                                                                                           networks
        Nouns                     31%               30%             29%             23%         34%        22%     27%       33%
      Adjectives                   9%                8%              6%              8%         9%         7%       8%       6%
       Adverbs                     2%                3%              3%              5%         4%         4%       4%       3%
     Determiners                  11%               10%              8%              8%         6%         8%       9%       7%
     Conjunctions                  6%                8%              7%             10%         6%         10%      9%       7%
      Pronouns                     2%                3%              5%              6%         5%         6%       4%       4%
     Prepositions                 15%               15%             12%             13%         8%         12%     13%       11%
Punctuaction marks                11%                8%             13%              9%         8%         9%      10%       11%
         Verbs                    12%               14%             17%             18%         19%        21%     16%       16%
  Other particles                  1%                1%              1%              1%         1%         1%       1%       1%

 Comparing user generated content published in different social media sources ⎢7
Distribution of PoS categories




 Nouns
   Common and proper nouns present similar distributions for all sources
   PoS tagger fails when proper nouns are written in lower case
                l   In special in Forums and Reviews where discussion about specific products are raised
                l   Solution: use gazetteers
                            Improves entity detection
                            Domain dependent
            Foreign words are less used in news that in other sources because of style rules
             of Spanish mainstream media
                l   Avoid foreign words, as far as possible, whenever a Spanish word exists
 Adjectives
   Adjectives of quantity are the most used (47%) in all the channels
                l   Cardinals (30%) more used than ordinals (2%)
            Multiplicative, partitive and indefinite quantity adjectives are used more frequently
             in forums and review sites:
                l   Due to quantitative evaluations and comparison of products


Comparing user generated content published in different social media sources ⎢8
Distribution of PoS categories




 Adverbs
   There is a correlation with the distribution of adverbs of negation and the size of
    the posts
                l   More used in channels with shorter texts
                l   Detection of negations is essential when performing sentiment analysis
 Conjunctions
   The distribution of coordinating conjunctions is higher in News and Blogs
                l   More used in channels with longer texts
                l   Coordinating conjunctions are used to identify opinion chunks as they were punctuation
                    marks.
 Pronouns
   The distribution of personal pronouns is higher in Microblogs, Reviews, Forums
     and audio-visual content publishing sites
                l   Due to conversations between users vs. narrative style of News and Blogs
                l   Pronouns make it difficult to identify entities within opinions
                            Entities not explicitly mentioned



Comparing user generated content published in different social media sources ⎢9
Distribution of PoS categories




 Punctuation marks
   Full stop less used in news
                l   Sentences are longer than in other sources
        Comma less used on Microblogs and Audio-visual content sites
        Ellipses are more used in Microblogs
                l   To denote unfinished sentences
                l   Automatically truncated messages
            Secondary punctuation marks less used in Microblogs
                l   Difficulty for introducing these characters on mobile terminals
                l   Content length limitation
 Verbs
   More used in Microblogs and Forums
                l   Intentions and actions are expressed more often
            Past tenses less used in Microblogs
                l   Immediate experiences
            Infinitive more used in Microblogs

Comparing user generated content published in different social media sources ⎢10
Comparing user generated content published in different social media sources


Performance of language
identification
Performance of Language Identification




 Content analysed
            3,368 tweets
            2,768 posts extracted from other social media sources (not
             Twitter)
            Written in Spanish, Portuguese and English


 Technique used
            Implementation of an existing text categorization algorithm
                l   Analysis of the frequency of n-grams of characters within documents
                    [Cavnar and Trenkle, 1994]

       Cavnar, W. B., & Trenkle, J. M. (1994). N-Gram-Based Text Categorization. Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis
       and Information Retrieval (pp. 161-175).



Comparing user generated content published in different social media sources ⎢12
Performance of Language Identification


 Language identification method




Comparing user generated content published in different social media sources ⎢13
Performance of Language Identification


 Evaluation Results
   Overall accuracy
                l    Twitter: 93.02%
                l    Other sources: 96.76%
            Kappa
                l    Twitter: 0.844
                l    Other sources: 0.916



 Normalizing tweets does not improve performance
   Syntactic normalization of Twitter messages [Kauffmann and Jugal, 2010]
                1.    Delete references to users at the beginning of the tweet
                2.    Delete “RT @user:” sequences
                3.    Delete hash tags found at the end of the tweet
                4.    Delete “#” at the beginning of hash tags
                5.    Delete URLs
                6.    Delete “…” followed by a URL
       Max Kaufmann and Kalita Jugal. 2010. Syntactic normalization of twitter messages. In Proceedings of the International Conference on Natural
       Language Processing (ICON-2010).
Comparing user generated content published in different social media sources ⎢14
Comparing user generated content published in different social media sources


Performance of sentiment
analysis
Performance of Sentiment Analysis

     Content analysed
       1,859 tweets and 1,847 posts extracted from other social media sources (not
         Twitter) written in Spanish
     Technique used
       Matching of linguistic expressions based on a Lexicon
           l  Each expression is a sequence of pairs (lemma, PoS)
                            E.g. “Your brand is cool!” matches with {(Σ,Noun),(‘be’,Verb), (‘cool’,Adjective)}
            Kind of expressions
               l For detecting subjectivity (20 expressions)
                            Use to include specific verbs
                l   For detecting sentiment of opinions (1,480 expressions)
                            Negative expressions add a value in {-2,-1} to overall sentiment
                            Positive expressions add a value in {1,2} to overall sentiment
                l   For reversing sentiment (22)
                            Include negations
                            Multiply detected sentiment by (-1)
                l   For augmenting or reducing sentiment (32)
                            Use to include adverbs
                            Multiply detected sentiment by 1.5 or 0.75
Comparing user generated content published in different social media sources ⎢16
Performance of Sentiment Analysis


 Evaluation Results
   Overall accuracy
                l    Twitter: 66.92%
                l    Other sources: 80.17%
            Kappa
                l    Twitter: 0.198
                l    Other sources: 0.31


 Normalizing tweets does not improve performance
   Syntactic normalization of Twitter messages [Kauffmann and Jugal, 2010]
                1.    Delete references to users at the beginning of the tweet
                2.    Delete “RT @user:” sequences
                3.    Delete hash tags found at the end of the tweet
                4.    Delete “#” at the beginning of hash tags
                5.    Delete URLs
                6.    Delete “…” followed by a URL
       Max Kaufmann and Kalita Jugal. 2010. Syntactic normalization of twitter messages. In Proceedings of the International Conference on Natural
       Language Processing (ICON-2010).



Comparing user generated content published in different social media sources ⎢17
Comparing user generated content published in different social media sources


Performance of topic
identification
Performance of topic identification


     Description of the method [Muñoz-García et al., 2011]


  Input




   PoS           • “torino”, “art”, “media”, “user”, “cloud”
 Filtering


                • http://dbpedia.org/resource/Turin
                • http://dbpedia.org/resource/Art
  Topic
Recognition     • http://dbpedia.org/resource/User_(computing)



Language
                 • “Torino”, “arte”, “utente”, “mezzo di comunicazione di massa”, ...
 Filtering



             Óscar Muñoz-Garcíaa, Andrés García-Silva, Óscar Corcho, Manuel de la Higuera Hern´andez, and Carlos Navarro. 2011. Identifying Topics in Social
             Media Posts using DBpedia. In Jean-Dominique Meunier, Halid Hrasnica, and Florent Genoux, editors, Proceedings of the NEM Summit 2011, pages
             81–86, Torino, Italy. Eurescom the European Institute for Research and Strategic Studies in Telecommunications GmbH.


   Comparing user generated content published in different social media sources ⎢19
Performance of topic identification




 PoS filtering example

                                                  • But a hardware problem is more likely, especially if
                                                    you use the phone a lot while eating. The
                                                    Blackberry's tiny trackball could be suffering the
                                                    same accumulation of gunk and grime that can
                                                    plague a computer mouse that still uses a rubber
                             Input                  ball on the underside to roll around the desk.




                                  • Blackberry, phone, trackball, computer,
                                    problem, grime, hardware, mouse, desk,
                     PoS filtering rubber ball, gunk
                         example



Comparing user generated content published in different social media sources ⎢20
Performance of topic identification


     Topic Recognition (Sem4Tags [García-Silva et al, 2010])

                    • Blackberry, phone, trackball, computer, problem, grime, hardware,
     PoS              mouse, desk, rubber ball, gunk
  filtering


          • Blackberry, {phone, hardware, trackball, mouse}
          • Computer, {hardware, mouse, problem, desk}
 Context
Selection • …


                    • http://dbpedia.org/resource/BlackBerry
                    • http://dbpedia.org/resource/Computer
Disambiguation




           Andrés García-Silva, Oscar Corcho, and Jorge Gracia. 2010. Associating semantics to multilingual tags in folksonomies. In 17th Int.
           Conference on Knowledge Engineering and Knowledge Management EKAW 2010, Lisbon (Portugal), October


    Comparing user generated content published in different social media sources ⎢21
Performance of topic identification


 Context Selection
        For each keyword, a set of up to 4 related keywords that will help to
         disambiguate the its meaning
        4 is the number of words above which the context does not add more resolving
         power to disambiguation [Kaplan, 1955]
        We compute semantic relatedness (active context) taking into account the
         co-ocurrence of words in web pages [Gracia et al, 2009]
                                     Keyword                 Relatedness      Keyword       Relatedness
                                      phone                     0.347         hardware         0.347
                                      trackball                 0.311          mouse           0.311
                                     computer                   0.288            desk          0.287
                                      problem                   0.246         rubber ball      0.246
                                        grime                   0.190           gunk           0.168


                  Active context selection for blackberry keyword
      A. Kaplan.1955. An experimental study of ambiguity and context. Mechanical Translation, 2:39-46

      Jorge Gracia and Eduardo Mena. 2009. Multiontology semantic disambiguation in unstructured web contexts. In
      Proc. of Workshop on Collective Knowledge Capturing and Representation (CKCaR’09) at K-CAP’09,

Identifying Topics in Social Media Posts using DBpedia ⎢22
Performance of topic identification




  Disambiguation Criteria
              OPTION 1: Most frequent sense for the ambiguous word
                 l        Determined by Wikipedia editors (the first link in a disambiguation page)
              OPTION 2: Vector space model
                     1.   A vector containing the keyword and its context
                     2.   A vector containing top N terms is created from each candidate sense is created using
                          TF-IDF (Term Frequency and Inverse Document Frequency)
                     3.   The cosine similarity is used to determine which vectorised sense is more similar to
                          the vector associated to the keyword

  DBpedia resource                           Definition                       Similarity
                                Is a line of mobile e-mail and
BlackBerry                                                                          0.224
                                smartphone
Blackberry                      is an edible fruit                                  0.15
BlackBerry_(song)               is a song by the Black Crowes                        0.0
BlackBerry_Township,
_Itasca_County,                 Is a towship in … Itasca County                      0.0
_Minnesota



 Comparing user generated content published in different social media sources ⎢23
Performance of topic identification




 Evaluation settings
    Evaluated a random sample of 1,816 posts (18,16%)
    47 human evaluators
    Each post and topics identified shown to 3 different evaluators
    Evaluation options:
                 1.     The topic is not related with the post
                 2.     The topic is somehow related with the post
                 3.     The topic is closely related with the post
                 4.     The evaluator has not enough information for taking a decision
               Fleiss’ kappa test
                 l      Strength of agreement for 2 evaluators = 0.826 (very good)
                 l      Strength of agreement for 3 evaluators = 0.493 (moderate)




Comparing user generated content published in different social media sources ⎢24
Performance of topic identification




 Evaluation Results




             Precision depends on the channel
                 l    From 59.19% for social networks
                              More misspellings
                              More common nouns
                 l    To 88.89% for review sites
                              Concrete products and brands
                              Proper nouns tend to have a Wikipedia entry
             Context selection criteria also depends on the channel
                 l    Active context selection better for microblogs and review sites
                 l    Considering all the post keywords as context better for blogs
                 l    Without context selection is better for the rest of the cases (almost all the channels)
                              Naïve default sense selection is effective

Comparing user generated content published in different social media sources ⎢25
Comparing user generated content published in different social media sources


Conclusions
Conclusions




 We have found differences among social media sources for every
      experiment executed
             Distribution of PoS tagging vary across different sources
                 l    Since PoS tagging is a previous step for many NLP techniques, the
                      performance of such techniques may be affected
                              E.g. Using nouns as context for performing term disambiguation.
                                      More nouns → More context
                              E.g. Adjectives and adverbs for performing sentiment analysis
          Language identification is less accurate for content extracted from
           Twitter
          Sentiment analysis is less accurate for content extracted from Twitter
          Precision of topic identification also depends on the source
                 l    With respect to context selection there is not a technique that performs
                      better for all the sources



Comparing user generated content published in different social media sources ⎢27
Thank you!
 oscar.munoz@havasmedia.com

Más contenido relacionado

Similar a Comparing user generated content published in different social media sources

(Nov 2011) Blogademia Today, Tomorrow? Scholar Bloggers' Preservation Percept...
(Nov 2011) Blogademia Today, Tomorrow? Scholar Bloggers' Preservation Percept...(Nov 2011) Blogademia Today, Tomorrow? Scholar Bloggers' Preservation Percept...
(Nov 2011) Blogademia Today, Tomorrow? Scholar Bloggers' Preservation Percept...Carolyn Hank
 
Social CRM becoming a reality
Social CRM becoming a reality Social CRM becoming a reality
Social CRM becoming a reality BisnodeInteract
 
2010 Social Networking Report
2010 Social Networking Report2010 Social Networking Report
2010 Social Networking ReportTom Blefko
 
Lee Rainie - The new impact of libraries
Lee Rainie - The new impact of librariesLee Rainie - The new impact of libraries
Lee Rainie - The new impact of librariesnvbonline
 
Seattle Interactive Conference - Social and Seach
Seattle Interactive Conference - Social and SeachSeattle Interactive Conference - Social and Seach
Seattle Interactive Conference - Social and SeachMicrosoft
 
Increasing Social Media ROI Using Gladwell's Tipping Point Framework
Increasing Social Media ROI Using Gladwell's Tipping Point FrameworkIncreasing Social Media ROI Using Gladwell's Tipping Point Framework
Increasing Social Media ROI Using Gladwell's Tipping Point FrameworkColleen Carrington
 
Social Media 2009
Social Media 2009Social Media 2009
Social Media 2009frozenfrogs
 
Converseon Measuring ROI of Sm Bulldog Reporter062909
Converseon Measuring ROI of Sm Bulldog Reporter062909Converseon Measuring ROI of Sm Bulldog Reporter062909
Converseon Measuring ROI of Sm Bulldog Reporter062909Jeni Putalavage-Ross
 
The Social Web. Why Brands Must Listen, Measure and Act v2.0
The Social Web. Why Brands Must Listen, Measure and Act v2.0The Social Web. Why Brands Must Listen, Measure and Act v2.0
The Social Web. Why Brands Must Listen, Measure and Act v2.0Visible Technologies
 
Transforming Public Engagement
Transforming Public EngagementTransforming Public Engagement
Transforming Public EngagementCraig Thomler
 
Mobile devcon metrics of the mobile web
Mobile devcon   metrics of the mobile webMobile devcon   metrics of the mobile web
Mobile devcon metrics of the mobile webAvenga Germany GmbH
 
Still Setting the Pace in Social Media: The First Longitudinal Study of Usage...
Still Setting the Pace in Social Media: The First Longitudinal Study of Usage...Still Setting the Pace in Social Media: The First Longitudinal Study of Usage...
Still Setting the Pace in Social Media: The First Longitudinal Study of Usage...Elizabeth Lupfer
 
(Sept 2011) Considerations for Preserving Blogademia: Scholar Bloggers’ Perce...
(Sept 2011) Considerations for Preserving Blogademia: Scholar Bloggers’ Perce...(Sept 2011) Considerations for Preserving Blogademia: Scholar Bloggers’ Perce...
(Sept 2011) Considerations for Preserving Blogademia: Scholar Bloggers’ Perce...Carolyn Hank
 
Leveraging an international infrastructure: Case studies from the Encyclopeda...
Leveraging an international infrastructure: Case studies from the Encyclopeda...Leveraging an international infrastructure: Case studies from the Encyclopeda...
Leveraging an international infrastructure: Case studies from the Encyclopeda...Cyndy Parr
 

Similar a Comparing user generated content published in different social media sources (20)

Social Media Strategy Roadmap
Social Media Strategy RoadmapSocial Media Strategy Roadmap
Social Media Strategy Roadmap
 
(Nov 2011) Blogademia Today, Tomorrow? Scholar Bloggers' Preservation Percept...
(Nov 2011) Blogademia Today, Tomorrow? Scholar Bloggers' Preservation Percept...(Nov 2011) Blogademia Today, Tomorrow? Scholar Bloggers' Preservation Percept...
(Nov 2011) Blogademia Today, Tomorrow? Scholar Bloggers' Preservation Percept...
 
The Rise of E-Reading
The Rise of E-ReadingThe Rise of E-Reading
The Rise of E-Reading
 
Social CRM becoming a reality
Social CRM becoming a reality Social CRM becoming a reality
Social CRM becoming a reality
 
2010 Social Networking Report
2010 Social Networking Report2010 Social Networking Report
2010 Social Networking Report
 
Lee Rainie - The new impact of libraries
Lee Rainie - The new impact of librariesLee Rainie - The new impact of libraries
Lee Rainie - The new impact of libraries
 
Seattle Interactive Conference - Social and Seach
Seattle Interactive Conference - Social and SeachSeattle Interactive Conference - Social and Seach
Seattle Interactive Conference - Social and Seach
 
Increasing Social Media ROI Using Gladwell's Tipping Point Framework
Increasing Social Media ROI Using Gladwell's Tipping Point FrameworkIncreasing Social Media ROI Using Gladwell's Tipping Point Framework
Increasing Social Media ROI Using Gladwell's Tipping Point Framework
 
THE NEXT LEVEL OF ENGAGEMENT: ADVANCED SOCIAL MEDIA STRATEGIES FOR NONPROFITS
THE NEXT LEVEL OF ENGAGEMENT: ADVANCED SOCIAL MEDIA STRATEGIES FOR NONPROFITSTHE NEXT LEVEL OF ENGAGEMENT: ADVANCED SOCIAL MEDIA STRATEGIES FOR NONPROFITS
THE NEXT LEVEL OF ENGAGEMENT: ADVANCED SOCIAL MEDIA STRATEGIES FOR NONPROFITS
 
Social Media 2009
Social Media 2009Social Media 2009
Social Media 2009
 
Libraries Transformed: Research on the changing role of libraries
Libraries Transformed:Research on the changing role of librariesLibraries Transformed:Research on the changing role of libraries
Libraries Transformed: Research on the changing role of libraries
 
Converseon Measuring ROI of Sm Bulldog Reporter062909
Converseon Measuring ROI of Sm Bulldog Reporter062909Converseon Measuring ROI of Sm Bulldog Reporter062909
Converseon Measuring ROI of Sm Bulldog Reporter062909
 
The Social Web. Why Brands Must Listen, Measure and Act v2.0
The Social Web. Why Brands Must Listen, Measure and Act v2.0The Social Web. Why Brands Must Listen, Measure and Act v2.0
The Social Web. Why Brands Must Listen, Measure and Act v2.0
 
The changing world of libraries
The changing world of librariesThe changing world of libraries
The changing world of libraries
 
Transforming Public Engagement
Transforming Public EngagementTransforming Public Engagement
Transforming Public Engagement
 
Mobile devcon metrics of the mobile web
Mobile devcon   metrics of the mobile webMobile devcon   metrics of the mobile web
Mobile devcon metrics of the mobile web
 
Still Setting the Pace in Social Media: The First Longitudinal Study of Usage...
Still Setting the Pace in Social Media: The First Longitudinal Study of Usage...Still Setting the Pace in Social Media: The First Longitudinal Study of Usage...
Still Setting the Pace in Social Media: The First Longitudinal Study of Usage...
 
Normalizing twitter
Normalizing twitterNormalizing twitter
Normalizing twitter
 
(Sept 2011) Considerations for Preserving Blogademia: Scholar Bloggers’ Perce...
(Sept 2011) Considerations for Preserving Blogademia: Scholar Bloggers’ Perce...(Sept 2011) Considerations for Preserving Blogademia: Scholar Bloggers’ Perce...
(Sept 2011) Considerations for Preserving Blogademia: Scholar Bloggers’ Perce...
 
Leveraging an international infrastructure: Case studies from the Encyclopeda...
Leveraging an international infrastructure: Case studies from the Encyclopeda...Leveraging an international infrastructure: Case studies from the Encyclopeda...
Leveraging an international infrastructure: Case studies from the Encyclopeda...
 

Más de Óscar Muñoz García

Methods and Techniques for Segmentation of Consumers in Social Media
Methods and Techniques for Segmentation of Consumers in Social MediaMethods and Techniques for Segmentation of Consumers in Social Media
Methods and Techniques for Segmentation of Consumers in Social MediaÓscar Muñoz García
 
Content Analytics for Media Agencies
Content Analytics for Media AgenciesContent Analytics for Media Agencies
Content Analytics for Media AgenciesÓscar Muñoz García
 
¿Cómo puede ayudar el Big Data a dirigir las campañas de comunicación?
¿Cómo puede ayudar el Big Data a dirigir las campañas de comunicación?¿Cómo puede ayudar el Big Data a dirigir las campañas de comunicación?
¿Cómo puede ayudar el Big Data a dirigir las campañas de comunicación?Óscar Muñoz García
 
Caracterización de los usuarios de medios sociales mediante lugar de residenc...
Caracterización de los usuarios de medios sociales mediante lugar de residenc...Caracterización de los usuarios de medios sociales mediante lugar de residenc...
Caracterización de los usuarios de medios sociales mediante lugar de residenc...Óscar Muñoz García
 
Identifying Topics in Social Media Posts using DBpedia
Identifying Topics in Social Media Posts using DBpediaIdentifying Topics in Social Media Posts using DBpedia
Identifying Topics in Social Media Posts using DBpediaÓscar Muñoz García
 
Análisis de Sentimientos en un Corpus de Redes Sociales
Análisis de Sentimientos en un Corpus de Redes SocialesAnálisis de Sentimientos en un Corpus de Redes Sociales
Análisis de Sentimientos en un Corpus de Redes SocialesÓscar Muñoz García
 
Social TV, más allá de la audiencia. Participación y relaciones
Social TV, más allá de la audiencia. Participación y relacionesSocial TV, más allá de la audiencia. Participación y relaciones
Social TV, más allá de la audiencia. Participación y relacionesÓscar Muñoz García
 

Más de Óscar Muñoz García (8)

Methods and Techniques for Segmentation of Consumers in Social Media
Methods and Techniques for Segmentation of Consumers in Social MediaMethods and Techniques for Segmentation of Consumers in Social Media
Methods and Techniques for Segmentation of Consumers in Social Media
 
Content Analytics for Media Agencies
Content Analytics for Media AgenciesContent Analytics for Media Agencies
Content Analytics for Media Agencies
 
¿Cómo puede ayudar el Big Data a dirigir las campañas de comunicación?
¿Cómo puede ayudar el Big Data a dirigir las campañas de comunicación?¿Cómo puede ayudar el Big Data a dirigir las campañas de comunicación?
¿Cómo puede ayudar el Big Data a dirigir las campañas de comunicación?
 
Caracterización de los usuarios de medios sociales mediante lugar de residenc...
Caracterización de los usuarios de medios sociales mediante lugar de residenc...Caracterización de los usuarios de medios sociales mediante lugar de residenc...
Caracterización de los usuarios de medios sociales mediante lugar de residenc...
 
Identifying Topics in Social Media Posts using DBpedia
Identifying Topics in Social Media Posts using DBpediaIdentifying Topics in Social Media Posts using DBpedia
Identifying Topics in Social Media Posts using DBpedia
 
Big Data and Marketing Technology
Big Data and Marketing TechnologyBig Data and Marketing Technology
Big Data and Marketing Technology
 
Análisis de Sentimientos en un Corpus de Redes Sociales
Análisis de Sentimientos en un Corpus de Redes SocialesAnálisis de Sentimientos en un Corpus de Redes Sociales
Análisis de Sentimientos en un Corpus de Redes Sociales
 
Social TV, más allá de la audiencia. Participación y relaciones
Social TV, más allá de la audiencia. Participación y relacionesSocial TV, más allá de la audiencia. Participación y relaciones
Social TV, más allá de la audiencia. Participación y relaciones
 

Último

Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 

Último (20)

Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 

Comparing user generated content published in different social media sources

  • 1. Comparing user generated content published in different social media sources Óscar Muñoz-García, Carlos Navarro @NLP can u tag #user_generated_content ?! via lrec-conf.org 26 May 2012
  • 2. Introduction  The growth of social media has populated the Web with valuable UGC that can be exploited for many interesting purposes  E.g. explaining or predicting real world outcomes through opinion mining  Advertising companies use social media content for market research  By mining users’ interests for focusing advertisement actions  By obtaining the opinion of customers about brands  NLP lets us automatizing social media content analysis  However, UGC presents differences on text quality w.r.t. content source (e.g., Blogs vs. Twitter)  Such differences challenge existing NLP techniques Comparing user generated content published in different social media sources ⎢2
  • 3. Introduction  We show the differences of the language used in UGC w.r.t. social media sources  By analysing the distribution of PoS categories on different sources  We evaluate the performance of three NLP techniques  Language Identification  Sentiment Analysis  Topic Identification  Social media sources analysed  Blogs (e.g., Wordpress and Blogger posts)  Forums  Microblogs (e.g., Twitter)  Social networks (e.g., Facebook, Google+, MySpace, LinkedIn and Xing)  Review Sites (e.g., Ciao and Dooyoo)  Audio-visual content publishing sites (e.g., Youtube and Vimeo)  News publishing sites (i.e., mainstream media)  Other sites Comparing user generated content published in different social media sources ⎢3
  • 4. Comparing user generated content published in different social media sources Distribution of PoS categories
  • 5. Distribution of PoS categories  Content analysed  Corpora with 10,000 posts extracted from heterogeneous SM sources l written in Spanish l related to telecommunications domain  The distribution has been obtained by using an automatic tagger  Tools used: l PoS tagging:  TreeTagger [Schmid, 1994] with a Spanish parameterisation l Annotation pipeline:  GATE [Cunningham et al., 2011]  Categories identified  Main: noun, adjective, adverb, determiner, conjunction, pronoun, verb, …  Secondary: common noun, proper noun, negation adverb, personal pronoun, … Helmut Schmid. 1994. Probabilistic part-of-speech tagging using decision trees. In Proceedings of International Conference on New Methods in Language Processing, Manchester, UK. Hamish Cunningham, Diana Maynard , Kalina Bontcheva et al. 2011. Text Processing with GATE (Version 6). University of Sheffield. Department of Computer Science, April. Comparing user generated content published in different social media sources ⎢5
  • 6. Distribution of PoS categories  Microblogs: determiners and prepositions are used to a lesser extent  Limitation of length (140 characters)  Posts need to be written more concisely → Meaningless grammatical categories tend to be used less Social News Blogs Video Reviews Microblogs Forums Other networks Nouns 31% 30% 29% 23% 34% 22% 27% 33% Adjectives 9% 8% 6% 8% 9% 7% 8% 6% Adverbs 2% 3% 3% 5% 4% 4% 4% 3% Determiners 11% 10% 8% 8% 6% 8% 9% 7% Conjunctions 6% 8% 7% 10% 6% 10% 9% 7% Pronouns 2% 3% 5% 6% 5% 6% 4% 4% Prepositions 15% 15% 12% 13% 8% 12% 13% 11% Punctuaction marks 11% 8% 13% 9% 8% 9% 10% 11% Verbs 12% 14% 17% 18% 19% 21% 16% 16% Other particles 1% 1% 1% 1% 1% 1% 1% 1% Comparing user generated content published in different social media sources ⎢6
  • 7. Distribution of PoS categories  News and blogs present similar distributions  Because of similar writing styles  No limitations on the size of posts Social News Blogs Video Reviews Microblogs Forums Other networks Nouns 31% 30% 29% 23% 34% 22% 27% 33% Adjectives 9% 8% 6% 8% 9% 7% 8% 6% Adverbs 2% 3% 3% 5% 4% 4% 4% 3% Determiners 11% 10% 8% 8% 6% 8% 9% 7% Conjunctions 6% 8% 7% 10% 6% 10% 9% 7% Pronouns 2% 3% 5% 6% 5% 6% 4% 4% Prepositions 15% 15% 12% 13% 8% 12% 13% 11% Punctuaction marks 11% 8% 13% 9% 8% 9% 10% 11% Verbs 12% 14% 17% 18% 19% 21% 16% 16% Other particles 1% 1% 1% 1% 1% 1% 1% 1% Comparing user generated content published in different social media sources ⎢7
  • 8. Distribution of PoS categories  Nouns  Common and proper nouns present similar distributions for all sources  PoS tagger fails when proper nouns are written in lower case l In special in Forums and Reviews where discussion about specific products are raised l Solution: use gazetteers  Improves entity detection  Domain dependent  Foreign words are less used in news that in other sources because of style rules of Spanish mainstream media l Avoid foreign words, as far as possible, whenever a Spanish word exists  Adjectives  Adjectives of quantity are the most used (47%) in all the channels l Cardinals (30%) more used than ordinals (2%)  Multiplicative, partitive and indefinite quantity adjectives are used more frequently in forums and review sites: l Due to quantitative evaluations and comparison of products Comparing user generated content published in different social media sources ⎢8
  • 9. Distribution of PoS categories  Adverbs  There is a correlation with the distribution of adverbs of negation and the size of the posts l More used in channels with shorter texts l Detection of negations is essential when performing sentiment analysis  Conjunctions  The distribution of coordinating conjunctions is higher in News and Blogs l More used in channels with longer texts l Coordinating conjunctions are used to identify opinion chunks as they were punctuation marks.  Pronouns  The distribution of personal pronouns is higher in Microblogs, Reviews, Forums and audio-visual content publishing sites l Due to conversations between users vs. narrative style of News and Blogs l Pronouns make it difficult to identify entities within opinions  Entities not explicitly mentioned Comparing user generated content published in different social media sources ⎢9
  • 10. Distribution of PoS categories  Punctuation marks  Full stop less used in news l Sentences are longer than in other sources  Comma less used on Microblogs and Audio-visual content sites  Ellipses are more used in Microblogs l To denote unfinished sentences l Automatically truncated messages  Secondary punctuation marks less used in Microblogs l Difficulty for introducing these characters on mobile terminals l Content length limitation  Verbs  More used in Microblogs and Forums l Intentions and actions are expressed more often  Past tenses less used in Microblogs l Immediate experiences  Infinitive more used in Microblogs Comparing user generated content published in different social media sources ⎢10
  • 11. Comparing user generated content published in different social media sources Performance of language identification
  • 12. Performance of Language Identification  Content analysed  3,368 tweets  2,768 posts extracted from other social media sources (not Twitter)  Written in Spanish, Portuguese and English  Technique used  Implementation of an existing text categorization algorithm l Analysis of the frequency of n-grams of characters within documents [Cavnar and Trenkle, 1994] Cavnar, W. B., & Trenkle, J. M. (1994). N-Gram-Based Text Categorization. Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval (pp. 161-175). Comparing user generated content published in different social media sources ⎢12
  • 13. Performance of Language Identification  Language identification method Comparing user generated content published in different social media sources ⎢13
  • 14. Performance of Language Identification  Evaluation Results  Overall accuracy l Twitter: 93.02% l Other sources: 96.76%  Kappa l Twitter: 0.844 l Other sources: 0.916  Normalizing tweets does not improve performance  Syntactic normalization of Twitter messages [Kauffmann and Jugal, 2010] 1. Delete references to users at the beginning of the tweet 2. Delete “RT @user:” sequences 3. Delete hash tags found at the end of the tweet 4. Delete “#” at the beginning of hash tags 5. Delete URLs 6. Delete “…” followed by a URL Max Kaufmann and Kalita Jugal. 2010. Syntactic normalization of twitter messages. In Proceedings of the International Conference on Natural Language Processing (ICON-2010). Comparing user generated content published in different social media sources ⎢14
  • 15. Comparing user generated content published in different social media sources Performance of sentiment analysis
  • 16. Performance of Sentiment Analysis  Content analysed  1,859 tweets and 1,847 posts extracted from other social media sources (not Twitter) written in Spanish  Technique used  Matching of linguistic expressions based on a Lexicon l Each expression is a sequence of pairs (lemma, PoS)  E.g. “Your brand is cool!” matches with {(Σ,Noun),(‘be’,Verb), (‘cool’,Adjective)}  Kind of expressions l For detecting subjectivity (20 expressions)  Use to include specific verbs l For detecting sentiment of opinions (1,480 expressions)  Negative expressions add a value in {-2,-1} to overall sentiment  Positive expressions add a value in {1,2} to overall sentiment l For reversing sentiment (22)  Include negations  Multiply detected sentiment by (-1) l For augmenting or reducing sentiment (32)  Use to include adverbs  Multiply detected sentiment by 1.5 or 0.75 Comparing user generated content published in different social media sources ⎢16
  • 17. Performance of Sentiment Analysis  Evaluation Results  Overall accuracy l Twitter: 66.92% l Other sources: 80.17%  Kappa l Twitter: 0.198 l Other sources: 0.31  Normalizing tweets does not improve performance  Syntactic normalization of Twitter messages [Kauffmann and Jugal, 2010] 1. Delete references to users at the beginning of the tweet 2. Delete “RT @user:” sequences 3. Delete hash tags found at the end of the tweet 4. Delete “#” at the beginning of hash tags 5. Delete URLs 6. Delete “…” followed by a URL Max Kaufmann and Kalita Jugal. 2010. Syntactic normalization of twitter messages. In Proceedings of the International Conference on Natural Language Processing (ICON-2010). Comparing user generated content published in different social media sources ⎢17
  • 18. Comparing user generated content published in different social media sources Performance of topic identification
  • 19. Performance of topic identification  Description of the method [Muñoz-García et al., 2011] Input PoS • “torino”, “art”, “media”, “user”, “cloud” Filtering • http://dbpedia.org/resource/Turin • http://dbpedia.org/resource/Art Topic Recognition • http://dbpedia.org/resource/User_(computing) Language • “Torino”, “arte”, “utente”, “mezzo di comunicazione di massa”, ... Filtering Óscar Muñoz-Garcíaa, Andrés García-Silva, Óscar Corcho, Manuel de la Higuera Hern´andez, and Carlos Navarro. 2011. Identifying Topics in Social Media Posts using DBpedia. In Jean-Dominique Meunier, Halid Hrasnica, and Florent Genoux, editors, Proceedings of the NEM Summit 2011, pages 81–86, Torino, Italy. Eurescom the European Institute for Research and Strategic Studies in Telecommunications GmbH. Comparing user generated content published in different social media sources ⎢19
  • 20. Performance of topic identification  PoS filtering example • But a hardware problem is more likely, especially if you use the phone a lot while eating. The Blackberry's tiny trackball could be suffering the same accumulation of gunk and grime that can plague a computer mouse that still uses a rubber Input ball on the underside to roll around the desk. • Blackberry, phone, trackball, computer, problem, grime, hardware, mouse, desk, PoS filtering rubber ball, gunk example Comparing user generated content published in different social media sources ⎢20
  • 21. Performance of topic identification  Topic Recognition (Sem4Tags [García-Silva et al, 2010]) • Blackberry, phone, trackball, computer, problem, grime, hardware, PoS mouse, desk, rubber ball, gunk filtering • Blackberry, {phone, hardware, trackball, mouse} • Computer, {hardware, mouse, problem, desk} Context Selection • … • http://dbpedia.org/resource/BlackBerry • http://dbpedia.org/resource/Computer Disambiguation Andrés García-Silva, Oscar Corcho, and Jorge Gracia. 2010. Associating semantics to multilingual tags in folksonomies. In 17th Int. Conference on Knowledge Engineering and Knowledge Management EKAW 2010, Lisbon (Portugal), October Comparing user generated content published in different social media sources ⎢21
  • 22. Performance of topic identification  Context Selection  For each keyword, a set of up to 4 related keywords that will help to disambiguate the its meaning  4 is the number of words above which the context does not add more resolving power to disambiguation [Kaplan, 1955]  We compute semantic relatedness (active context) taking into account the co-ocurrence of words in web pages [Gracia et al, 2009] Keyword Relatedness Keyword Relatedness phone 0.347 hardware 0.347 trackball 0.311 mouse 0.311 computer 0.288 desk 0.287 problem 0.246 rubber ball 0.246 grime 0.190 gunk 0.168 Active context selection for blackberry keyword A. Kaplan.1955. An experimental study of ambiguity and context. Mechanical Translation, 2:39-46 Jorge Gracia and Eduardo Mena. 2009. Multiontology semantic disambiguation in unstructured web contexts. In Proc. of Workshop on Collective Knowledge Capturing and Representation (CKCaR’09) at K-CAP’09, Identifying Topics in Social Media Posts using DBpedia ⎢22
  • 23. Performance of topic identification  Disambiguation Criteria  OPTION 1: Most frequent sense for the ambiguous word l Determined by Wikipedia editors (the first link in a disambiguation page)  OPTION 2: Vector space model 1. A vector containing the keyword and its context 2. A vector containing top N terms is created from each candidate sense is created using TF-IDF (Term Frequency and Inverse Document Frequency) 3. The cosine similarity is used to determine which vectorised sense is more similar to the vector associated to the keyword DBpedia resource Definition Similarity Is a line of mobile e-mail and BlackBerry 0.224 smartphone Blackberry is an edible fruit 0.15 BlackBerry_(song) is a song by the Black Crowes 0.0 BlackBerry_Township, _Itasca_County, Is a towship in … Itasca County 0.0 _Minnesota Comparing user generated content published in different social media sources ⎢23
  • 24. Performance of topic identification  Evaluation settings  Evaluated a random sample of 1,816 posts (18,16%)  47 human evaluators  Each post and topics identified shown to 3 different evaluators  Evaluation options: 1. The topic is not related with the post 2. The topic is somehow related with the post 3. The topic is closely related with the post 4. The evaluator has not enough information for taking a decision  Fleiss’ kappa test l Strength of agreement for 2 evaluators = 0.826 (very good) l Strength of agreement for 3 evaluators = 0.493 (moderate) Comparing user generated content published in different social media sources ⎢24
  • 25. Performance of topic identification  Evaluation Results  Precision depends on the channel l From 59.19% for social networks  More misspellings  More common nouns l To 88.89% for review sites  Concrete products and brands  Proper nouns tend to have a Wikipedia entry  Context selection criteria also depends on the channel l Active context selection better for microblogs and review sites l Considering all the post keywords as context better for blogs l Without context selection is better for the rest of the cases (almost all the channels)  Naïve default sense selection is effective Comparing user generated content published in different social media sources ⎢25
  • 26. Comparing user generated content published in different social media sources Conclusions
  • 27. Conclusions  We have found differences among social media sources for every experiment executed  Distribution of PoS tagging vary across different sources l Since PoS tagging is a previous step for many NLP techniques, the performance of such techniques may be affected  E.g. Using nouns as context for performing term disambiguation.  More nouns → More context  E.g. Adjectives and adverbs for performing sentiment analysis  Language identification is less accurate for content extracted from Twitter  Sentiment analysis is less accurate for content extracted from Twitter  Precision of topic identification also depends on the source l With respect to context selection there is not a technique that performs better for all the sources Comparing user generated content published in different social media sources ⎢27