SlideShare una empresa de Scribd logo
1 de 22
Descargar para leer sin conexión
Language-Independent Twitter Sentiment Analysis
Sascha Narr, Michael Hülfenhaus, Sahin Albayrak


Sascha Narr
Competence Center Information Retrieval & Machine Learning


KDML 2012, LWA, Dortmund, Germany
Overview



►1. Sentiment analysis on social media
►2. Creation of a multilingual evaluation dataset of

 tweets
►3. A language-independent sentiment labeling

 heuristic for semi-supervised learning
►4. Experiments on the multilingual dataset




           18. September 2012   Language-Independent Twitter Sentiment Analysis   2
Overview



►1. Sentiment analysis on social media
►2. Creation of a multilingual evaluation dataset of

 tweets
►3. A language-independent sentiment labeling

 heuristic for semi-supervised learning
►4. Experiments on the multilingual dataset




           18. September 2012   Language-Independent Twitter Sentiment Analysis   3
1. Sentiment Analysis on Social Media


►   Why Sentiment Analysis?
       People’s opinions and sentiments about products and events
        in large numbers are invaluable:
       Market research, product feedback and more
       Sentiment Analysis allows to automatically collect such data

►   Why Twitter?
       400 Million tweets posted each day[1]
       Shorter text lengths encourage people to
        “just write” what they think
       Tweets are often informal and contain lots of opinions


                      [1]: http://news.cnet.com/8301-1023 3-57448388-93/twitter-hits-400-million-tweets-per-day-mostly-mobile/

              18. September 2012         Language-Independent Twitter Sentiment Analysis                                    4
1. Methods for Sentiment Classification

► Sentiment classification goals:
      Subjectivity: “Does the tweet contain an opinion?”
      Polarity: “Is the expressed opinion positive or negative?”
► Classifiers used:

      Naive Bayes, Maximum Entropy, Support Vector Machines
► Features used:

      n-grams, WordNet semantics, part-of-speech information

►   Tweet texts have unique properties:
       Informal, contain slang, emoticons, misspellings



              18. September 2012   Language-Independent Twitter Sentiment Analysis   5
1. Multilingual Sentiment Analysis

►Less than 40% of tweets are English [1]
►Natural language processing methods are often

 designed specifically for one language

►   Increase coverage of sentiment analysis by using a
    language-independent approach:
       No extra effort for additional languages
       Is the approach really effective for all languages?



                                  [1] http://semiocast.com/publications/2011_11_24_Arabic_highest_growth_on_Twitter


             18. September 2012      Language-Independent Twitter Sentiment Analysis                        6
Overview



►1. Sentiment analysis on social media
►2. Creation of a multilingual evaluation dataset of

 tweets
►3. A language-independent sentiment labeling

 heuristic for semi-supervised learning
►4. Experiments on the multilingual dataset




           18. September 2012   Language-Independent Twitter Sentiment Analysis   7
2. Creation of a Multilingual Evaluation Dataset


►   We created a hand-annotated sentiment evaluation
    dataset of over 12000 tweets
       4 languages: English, German, French, Portuguese
►Used the Amazon Mechanical Turk platform for
 annotation
►Each tweet was annotated by 3 different workers:

       Labels: “positive”, “neutral”, “negative”
       Added validation tweets to try to ensure the quality of the
        annotations




             18. September 2012   Language-Independent Twitter Sentiment Analysis   8
2. Our Multilingual Evaluation Dataset

►   Observed a low inter-annotator agreement in our dataset
       Sentiment classification is a hard task, even for humans
       Tweets that humans disagree on are harder to classify as
        well
►   The dataset is publicly available for research purposes




              Table 1: Tweet counts for the complete annotated dataset




             18. September 2012   Language-Independent Twitter Sentiment Analysis   9
Overview



►1. Sentiment analysis on social media
►2. Creation of a multilingual evaluation dataset of

 tweets
►3. A language-independent sentiment labeling

 heuristic for semi-supervised learning
►4. Experiments on the multilingual dataset




           18. September 2012   Language-Independent Twitter Sentiment Analysis   10
3. A Language-Independent Heuristic

► To train a sentiment classifier, a large amount of labeled
  training data is needed
      Can be obtained without human effort using a previously
       proposed heuristic
► The heuristic uses emoticons in tweets as noisy labels




►   Heuristic: If a tweet contains only positive emoticons, label its
    whole text as positive (and vice versa for negative).

►   Examples of emoticons we used:
           Positive:       :) :-) =) ;) :] :D ˆ-ˆ ˆ_ˆ
           Negative:       :( :-( :(( -.- >:-( D: :/


              18. September 2012   Language-Independent Twitter Sentiment Analysis   11
3. Heuristic for Semi-Supervised Learning

► Heuristic can be applied to almost any language, since
  emoticons are used extensively on Twitter
► Amount of tweets with emoticons differs among languages

     Caused by many factors like language-specific ways to
      express sentiments or different distributions of “formal”
      tweets




            Table 2: Number of tweets containing emoticons for each language




            18. September 2012   Language-Independent Twitter Sentiment Analysis   12
Overview



►1. Sentiment analysis on social media
►2. Creation of a multilingual evaluation dataset of

 tweets
►3. A language-independent sentiment labeling

 heuristic for semi-supervised learning
►4. Experiments on the multilingual dataset




           18. September 2012   Language-Independent Twitter Sentiment Analysis   13
4. Experiments – Sentiment Classification

►   Data:
       Training: From ~ 800M random tweets of mixed languages:
           Filter for languages: English, German, French, Portuguese
           Use emoticon heuristic to select and label training data
        Evaluation: 12597 hand-annotated tweets (4 languages)

►   Setup:
        Classification: Sentiment polarity only
        Classifier: Naive Bayes
        Features: 1-grams and 1, 2-grams
        Trained 4 classifiers for en, de, fr, pt
                  1 classifier for combined en+de+fr+pt


              18. September 2012   Language-Independent Twitter Sentiment Analysis   14
4. Experiments: Evaluation Dataset

► 2 variations of our evaluation set for the experiments:
      agree-3: Tweets all 3 annotators agreed on for a sentiment
      agree-2: Tweets at least 2 annotators agreed on
► Baseline: always guess “positive” (more pos. tweets than neg.)




               Table 3: Tweet counts for the evaluation datasets



           18. September 2012   Language-Independent Twitter Sentiment Analysis   15
4. Results – English Classifier

► Best results: English classifier using 1-grams, on the 3-agree set
      81.3% accuracy (500k trained tweets)
► Performance on 2-agree set constantly lower than 3-agree



                                                                en




            18. September 2012   Language-Independent Twitter Sentiment Analysis   16
4. Results – All Languages
                              en                                                de




                              fr                                                pt




         18. September 2012   Language-Independent Twitter Sentiment Analysis        17
4. Evaluation – All Languages Compared
                                                                 en                                 de
► Strong differences
  between languages
► Differences do not

  correlate with number
  of emoticons in each                                             fr                                   pt
  language

► Emoticon heuristic better
  fit for some languages,
  may depend on the style of
  expressing sentiment in it
► “muito engraçado kkkkkkkk”

                                          Table3: Tweet counts containing emoticons for each language



           18. September 2012   Language-Independent Twitter Sentiment Analysis                         18
4. Evaluation – Multi-language Classifier
► Tested on combined 4 language evaluation set
► Highest Performance: 71.5% accuracy

      Slightly less than using 4 individual classifiers (73.9% accuracy)
► Usefulness of combined classifier can outweigh performance

  degradation
                                                   en+de+fr+pt




            18. September 2012   Language-Independent Twitter Sentiment Analysis   19
Conclusions

►   We presented and evaluated a language-independent
    sentiment classification approach on 4 languages
        A language-independent classifier can be trained given only
         raw tweets, using a noisy label heuristic
        Good performances across languages, varies for each
        Classifiers need a very large number of tweets for training
        Mixed-language classifiers are viable

►   Future work:
        Currently we only classify sentiment polarity
        Classifying subjectivity in tweets is important, but finding a
         good heuristic to label “neutral” tweets is a challenge

               18. September 2012   Language-Independent Twitter Sentiment Analysis   20
Language-Independent Twitter Sentiment Analysis




         Thanks for your attention!

                            Questions?



           18. September 2012   Language-Independent Twitter Sentiment Analysis   21
Contact


Sascha Narr                                            DAI-Labor
Dipl.-Inform.                                          Technische Universität Berlin




                                                       Fakultät IV –
Competence Center Information Retrieval &              Elektrontechnik & Informatik
Machine Learning

sascha.narr@dai-labor.de                               Sekretariat TEL 14
Fon +49 (0) 30 / 314 – 74 138                          Ernst Reuter Platz 7
Fax +49 (0) 30 / 314 – 74 003                          10587 Berlin




                                                        www.dai-labor.de

                18. September 2012   Language-Independent Twitter Sentiment Analysis   22

Más contenido relacionado

Similar a Language-Independent Twitter Sentiment Analysis

Sentiment Analysis and Political Disaffection in Italy
Sentiment Analysis and Political Disaffection in ItalySentiment Analysis and Political Disaffection in Italy
Sentiment Analysis and Political Disaffection in ItalyCorrado Monti
 
D. Zardetto, Using Twitter data for the Social Mood on Economy Index
D. Zardetto, Using Twitter data for the Social Mood on Economy Index D. Zardetto, Using Twitter data for the Social Mood on Economy Index
D. Zardetto, Using Twitter data for the Social Mood on Economy Index Istituto nazionale di statistica
 
Affect Level Opinion Mining
Affect Level Opinion MiningAffect Level Opinion Mining
Affect Level Opinion MiningYasas Senarath
 
Rethinking Social Media Measurement
Rethinking Social Media MeasurementRethinking Social Media Measurement
Rethinking Social Media MeasurementMasood Akhtar
 
A tailor-made one-size-fits-all approach to sentiment analysis
A tailor-made one-size-fits-all approach to sentiment analysisA tailor-made one-size-fits-all approach to sentiment analysis
A tailor-made one-size-fits-all approach to sentiment analysisDiana Maynard
 
Sentiment Analysis of Social Media Content: A multi-tool for listening to you...
Sentiment Analysis of Social Media Content: A multi-tool for listening to you...Sentiment Analysis of Social Media Content: A multi-tool for listening to you...
Sentiment Analysis of Social Media Content: A multi-tool for listening to you...Eirini Ntoutsi
 
This assignment allows you to demonstrate mastery of outcome # 2.docx
This assignment allows you to demonstrate mastery of outcome # 2.docxThis assignment allows you to demonstrate mastery of outcome # 2.docx
This assignment allows you to demonstrate mastery of outcome # 2.docxhowardh5
 
IRJET- Real Time Sentiment Analysis of Political Twitter Data using Machi...
IRJET-  	  Real Time Sentiment Analysis of Political Twitter Data using Machi...IRJET-  	  Real Time Sentiment Analysis of Political Twitter Data using Machi...
IRJET- Real Time Sentiment Analysis of Political Twitter Data using Machi...IRJET Journal
 
Detecting insults in social media conversations
Detecting insults in social media conversationsDetecting insults in social media conversations
Detecting insults in social media conversationsraj
 
Sentiment analysis - Our approach and use cases
Sentiment analysis - Our approach and use casesSentiment analysis - Our approach and use cases
Sentiment analysis - Our approach and use casesKarol Chlasta
 
Intellexy social media analysis solutions d2011
Intellexy social media analysis solutions d2011Intellexy social media analysis solutions d2011
Intellexy social media analysis solutions d2011Maya Marashlian
 
Intellexy Social Media Monitoring and Analysis Solutions D2011
Intellexy Social Media Monitoring and Analysis Solutions D2011Intellexy Social Media Monitoring and Analysis Solutions D2011
Intellexy Social Media Monitoring and Analysis Solutions D2011MayaMar
 
A User Modeling Oriented Analysis of Cultural Backgrounds in Microblogging
A User Modeling Oriented Analysis of Cultural Backgrounds in MicrobloggingA User Modeling Oriented Analysis of Cultural Backgrounds in Microblogging
A User Modeling Oriented Analysis of Cultural Backgrounds in MicrobloggingElena Daehnhardt
 
To Label or Not? Advances and Open Challenges in SE-specific Sentiment Analysis
To Label or Not? Advances and Open Challenges in SE-specific Sentiment AnalysisTo Label or Not? Advances and Open Challenges in SE-specific Sentiment Analysis
To Label or Not? Advances and Open Challenges in SE-specific Sentiment AnalysisNicole Novielli
 
Exciting Strategies for GED Test Preparation Instruction
Exciting Strategies for GED Test Preparation InstructionExciting Strategies for GED Test Preparation Instruction
Exciting Strategies for GED Test Preparation InstructionMeagen Farrell
 
VenTESOL Social Media for Effective Teacher Development
VenTESOL Social Media for Effective Teacher DevelopmentVenTESOL Social Media for Effective Teacher Development
VenTESOL Social Media for Effective Teacher DevelopmentAndrés Ramos
 
Twitter, sentiment and finance: how qualitative information and markets are r...
Twitter, sentiment and finance: how qualitative information and markets are r...Twitter, sentiment and finance: how qualitative information and markets are r...
Twitter, sentiment and finance: how qualitative information and markets are r...Giacomo Carozza
 
Multi-lingual Twitter sentiment analysis using machine learning
Multi-lingual Twitter sentiment analysis using machine learning Multi-lingual Twitter sentiment analysis using machine learning
Multi-lingual Twitter sentiment analysis using machine learning IJECEIAES
 

Similar a Language-Independent Twitter Sentiment Analysis (20)

Sentiment Analysis and Political Disaffection in Italy
Sentiment Analysis and Political Disaffection in ItalySentiment Analysis and Political Disaffection in Italy
Sentiment Analysis and Political Disaffection in Italy
 
D. Zardetto, Using Twitter data for the Social Mood on Economy Index
D. Zardetto, Using Twitter data for the Social Mood on Economy Index D. Zardetto, Using Twitter data for the Social Mood on Economy Index
D. Zardetto, Using Twitter data for the Social Mood on Economy Index
 
Affect Level Opinion Mining
Affect Level Opinion MiningAffect Level Opinion Mining
Affect Level Opinion Mining
 
Rethinking Social Media Measurement
Rethinking Social Media MeasurementRethinking Social Media Measurement
Rethinking Social Media Measurement
 
A tailor-made one-size-fits-all approach to sentiment analysis
A tailor-made one-size-fits-all approach to sentiment analysisA tailor-made one-size-fits-all approach to sentiment analysis
A tailor-made one-size-fits-all approach to sentiment analysis
 
Project report
Project reportProject report
Project report
 
Perspective pitch
Perspective pitchPerspective pitch
Perspective pitch
 
Sentiment Analysis of Social Media Content: A multi-tool for listening to you...
Sentiment Analysis of Social Media Content: A multi-tool for listening to you...Sentiment Analysis of Social Media Content: A multi-tool for listening to you...
Sentiment Analysis of Social Media Content: A multi-tool for listening to you...
 
This assignment allows you to demonstrate mastery of outcome # 2.docx
This assignment allows you to demonstrate mastery of outcome # 2.docxThis assignment allows you to demonstrate mastery of outcome # 2.docx
This assignment allows you to demonstrate mastery of outcome # 2.docx
 
IRJET- Real Time Sentiment Analysis of Political Twitter Data using Machi...
IRJET-  	  Real Time Sentiment Analysis of Political Twitter Data using Machi...IRJET-  	  Real Time Sentiment Analysis of Political Twitter Data using Machi...
IRJET- Real Time Sentiment Analysis of Political Twitter Data using Machi...
 
Detecting insults in social media conversations
Detecting insults in social media conversationsDetecting insults in social media conversations
Detecting insults in social media conversations
 
Sentiment analysis - Our approach and use cases
Sentiment analysis - Our approach and use casesSentiment analysis - Our approach and use cases
Sentiment analysis - Our approach and use cases
 
Intellexy social media analysis solutions d2011
Intellexy social media analysis solutions d2011Intellexy social media analysis solutions d2011
Intellexy social media analysis solutions d2011
 
Intellexy Social Media Monitoring and Analysis Solutions D2011
Intellexy Social Media Monitoring and Analysis Solutions D2011Intellexy Social Media Monitoring and Analysis Solutions D2011
Intellexy Social Media Monitoring and Analysis Solutions D2011
 
A User Modeling Oriented Analysis of Cultural Backgrounds in Microblogging
A User Modeling Oriented Analysis of Cultural Backgrounds in MicrobloggingA User Modeling Oriented Analysis of Cultural Backgrounds in Microblogging
A User Modeling Oriented Analysis of Cultural Backgrounds in Microblogging
 
To Label or Not? Advances and Open Challenges in SE-specific Sentiment Analysis
To Label or Not? Advances and Open Challenges in SE-specific Sentiment AnalysisTo Label or Not? Advances and Open Challenges in SE-specific Sentiment Analysis
To Label or Not? Advances and Open Challenges in SE-specific Sentiment Analysis
 
Exciting Strategies for GED Test Preparation Instruction
Exciting Strategies for GED Test Preparation InstructionExciting Strategies for GED Test Preparation Instruction
Exciting Strategies for GED Test Preparation Instruction
 
VenTESOL Social Media for Effective Teacher Development
VenTESOL Social Media for Effective Teacher DevelopmentVenTESOL Social Media for Effective Teacher Development
VenTESOL Social Media for Effective Teacher Development
 
Twitter, sentiment and finance: how qualitative information and markets are r...
Twitter, sentiment and finance: how qualitative information and markets are r...Twitter, sentiment and finance: how qualitative information and markets are r...
Twitter, sentiment and finance: how qualitative information and markets are r...
 
Multi-lingual Twitter sentiment analysis using machine learning
Multi-lingual Twitter sentiment analysis using machine learning Multi-lingual Twitter sentiment analysis using machine learning
Multi-lingual Twitter sentiment analysis using machine learning
 

Último

Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsYoss Cohen
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
Kuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialKuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialJoão Esperancinha
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sectoritnewsafrica
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...itnewsafrica
 
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...BookNet Canada
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 

Último (20)

Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platforms
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
Kuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialKuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorial
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
 
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 

Language-Independent Twitter Sentiment Analysis

  • 1. Language-Independent Twitter Sentiment Analysis Sascha Narr, Michael Hülfenhaus, Sahin Albayrak Sascha Narr Competence Center Information Retrieval & Machine Learning KDML 2012, LWA, Dortmund, Germany
  • 2. Overview ►1. Sentiment analysis on social media ►2. Creation of a multilingual evaluation dataset of tweets ►3. A language-independent sentiment labeling heuristic for semi-supervised learning ►4. Experiments on the multilingual dataset 18. September 2012 Language-Independent Twitter Sentiment Analysis 2
  • 3. Overview ►1. Sentiment analysis on social media ►2. Creation of a multilingual evaluation dataset of tweets ►3. A language-independent sentiment labeling heuristic for semi-supervised learning ►4. Experiments on the multilingual dataset 18. September 2012 Language-Independent Twitter Sentiment Analysis 3
  • 4. 1. Sentiment Analysis on Social Media ► Why Sentiment Analysis?  People’s opinions and sentiments about products and events in large numbers are invaluable:  Market research, product feedback and more  Sentiment Analysis allows to automatically collect such data ► Why Twitter?  400 Million tweets posted each day[1]  Shorter text lengths encourage people to “just write” what they think  Tweets are often informal and contain lots of opinions [1]: http://news.cnet.com/8301-1023 3-57448388-93/twitter-hits-400-million-tweets-per-day-mostly-mobile/ 18. September 2012 Language-Independent Twitter Sentiment Analysis 4
  • 5. 1. Methods for Sentiment Classification ► Sentiment classification goals:  Subjectivity: “Does the tweet contain an opinion?”  Polarity: “Is the expressed opinion positive or negative?” ► Classifiers used:  Naive Bayes, Maximum Entropy, Support Vector Machines ► Features used:  n-grams, WordNet semantics, part-of-speech information ► Tweet texts have unique properties:  Informal, contain slang, emoticons, misspellings 18. September 2012 Language-Independent Twitter Sentiment Analysis 5
  • 6. 1. Multilingual Sentiment Analysis ►Less than 40% of tweets are English [1] ►Natural language processing methods are often designed specifically for one language ► Increase coverage of sentiment analysis by using a language-independent approach: No extra effort for additional languages Is the approach really effective for all languages? [1] http://semiocast.com/publications/2011_11_24_Arabic_highest_growth_on_Twitter 18. September 2012 Language-Independent Twitter Sentiment Analysis 6
  • 7. Overview ►1. Sentiment analysis on social media ►2. Creation of a multilingual evaluation dataset of tweets ►3. A language-independent sentiment labeling heuristic for semi-supervised learning ►4. Experiments on the multilingual dataset 18. September 2012 Language-Independent Twitter Sentiment Analysis 7
  • 8. 2. Creation of a Multilingual Evaluation Dataset ► We created a hand-annotated sentiment evaluation dataset of over 12000 tweets  4 languages: English, German, French, Portuguese ►Used the Amazon Mechanical Turk platform for annotation ►Each tweet was annotated by 3 different workers:  Labels: “positive”, “neutral”, “negative”  Added validation tweets to try to ensure the quality of the annotations 18. September 2012 Language-Independent Twitter Sentiment Analysis 8
  • 9. 2. Our Multilingual Evaluation Dataset ► Observed a low inter-annotator agreement in our dataset  Sentiment classification is a hard task, even for humans  Tweets that humans disagree on are harder to classify as well ► The dataset is publicly available for research purposes Table 1: Tweet counts for the complete annotated dataset 18. September 2012 Language-Independent Twitter Sentiment Analysis 9
  • 10. Overview ►1. Sentiment analysis on social media ►2. Creation of a multilingual evaluation dataset of tweets ►3. A language-independent sentiment labeling heuristic for semi-supervised learning ►4. Experiments on the multilingual dataset 18. September 2012 Language-Independent Twitter Sentiment Analysis 10
  • 11. 3. A Language-Independent Heuristic ► To train a sentiment classifier, a large amount of labeled training data is needed  Can be obtained without human effort using a previously proposed heuristic ► The heuristic uses emoticons in tweets as noisy labels ► Heuristic: If a tweet contains only positive emoticons, label its whole text as positive (and vice versa for negative). ► Examples of emoticons we used:  Positive: :) :-) =) ;) :] :D ˆ-ˆ ˆ_ˆ  Negative: :( :-( :(( -.- >:-( D: :/ 18. September 2012 Language-Independent Twitter Sentiment Analysis 11
  • 12. 3. Heuristic for Semi-Supervised Learning ► Heuristic can be applied to almost any language, since emoticons are used extensively on Twitter ► Amount of tweets with emoticons differs among languages  Caused by many factors like language-specific ways to express sentiments or different distributions of “formal” tweets Table 2: Number of tweets containing emoticons for each language 18. September 2012 Language-Independent Twitter Sentiment Analysis 12
  • 13. Overview ►1. Sentiment analysis on social media ►2. Creation of a multilingual evaluation dataset of tweets ►3. A language-independent sentiment labeling heuristic for semi-supervised learning ►4. Experiments on the multilingual dataset 18. September 2012 Language-Independent Twitter Sentiment Analysis 13
  • 14. 4. Experiments – Sentiment Classification ► Data:  Training: From ~ 800M random tweets of mixed languages:  Filter for languages: English, German, French, Portuguese  Use emoticon heuristic to select and label training data  Evaluation: 12597 hand-annotated tweets (4 languages) ► Setup:  Classification: Sentiment polarity only  Classifier: Naive Bayes  Features: 1-grams and 1, 2-grams  Trained 4 classifiers for en, de, fr, pt 1 classifier for combined en+de+fr+pt 18. September 2012 Language-Independent Twitter Sentiment Analysis 14
  • 15. 4. Experiments: Evaluation Dataset ► 2 variations of our evaluation set for the experiments:  agree-3: Tweets all 3 annotators agreed on for a sentiment  agree-2: Tweets at least 2 annotators agreed on ► Baseline: always guess “positive” (more pos. tweets than neg.) Table 3: Tweet counts for the evaluation datasets 18. September 2012 Language-Independent Twitter Sentiment Analysis 15
  • 16. 4. Results – English Classifier ► Best results: English classifier using 1-grams, on the 3-agree set  81.3% accuracy (500k trained tweets) ► Performance on 2-agree set constantly lower than 3-agree en 18. September 2012 Language-Independent Twitter Sentiment Analysis 16
  • 17. 4. Results – All Languages en de fr pt 18. September 2012 Language-Independent Twitter Sentiment Analysis 17
  • 18. 4. Evaluation – All Languages Compared en de ► Strong differences between languages ► Differences do not correlate with number of emoticons in each fr pt language ► Emoticon heuristic better fit for some languages, may depend on the style of expressing sentiment in it ► “muito engraçado kkkkkkkk” Table3: Tweet counts containing emoticons for each language 18. September 2012 Language-Independent Twitter Sentiment Analysis 18
  • 19. 4. Evaluation – Multi-language Classifier ► Tested on combined 4 language evaluation set ► Highest Performance: 71.5% accuracy  Slightly less than using 4 individual classifiers (73.9% accuracy) ► Usefulness of combined classifier can outweigh performance degradation en+de+fr+pt 18. September 2012 Language-Independent Twitter Sentiment Analysis 19
  • 20. Conclusions ► We presented and evaluated a language-independent sentiment classification approach on 4 languages  A language-independent classifier can be trained given only raw tweets, using a noisy label heuristic  Good performances across languages, varies for each  Classifiers need a very large number of tweets for training  Mixed-language classifiers are viable ► Future work:  Currently we only classify sentiment polarity  Classifying subjectivity in tweets is important, but finding a good heuristic to label “neutral” tweets is a challenge 18. September 2012 Language-Independent Twitter Sentiment Analysis 20
  • 21. Language-Independent Twitter Sentiment Analysis Thanks for your attention! Questions? 18. September 2012 Language-Independent Twitter Sentiment Analysis 21
  • 22. Contact Sascha Narr DAI-Labor Dipl.-Inform. Technische Universität Berlin Fakultät IV – Competence Center Information Retrieval & Elektrontechnik & Informatik Machine Learning sascha.narr@dai-labor.de Sekretariat TEL 14 Fon +49 (0) 30 / 314 – 74 138 Ernst Reuter Platz 7 Fax +49 (0) 30 / 314 – 74 003 10587 Berlin www.dai-labor.de 18. September 2012 Language-Independent Twitter Sentiment Analysis 22