SlideShare una empresa de Scribd logo
1 de 40
Building Large Arabic
Multi-domain Resources
for Sentiment Analysis
Hady ElSahar and Samhaa R. El-Beltagy
Center for Informatics Science, Nile University
CICLing 2015 – April 19, 2014
hadyelsahar@gmail.com
Agenda
• Problem Statement
• Building Multi-Domain Datasets for Sentiment Analysis
• Building Multi-Domain lexicons
• Experiments and Evaluation
• Mining experiments results
Problem Statement
Problem Statement
• Small size
• Domain Specificity
• Not publicly available
• Insufficient coverage of different Arabic dialects and non standard
terms
Current resources for sentiment analysis suffer many deficiencies:
Problem Statement
Author Dataset name Size Multi Domain Publicly Available
Rushdi-Saleh et al. OCA 500 NO YES
Abdul-Mageed & Diab AWATIF < 10K Yes NO
Aly, M. & Atiya, A. LABR 63K NO YES
Eshrag Refaee et al. Twitter Corpus 8,868 N/A YES
Sentiment Datasets related work
Problem Statement
Author Size MSA / Dialect Multi Domain Publicly Available
El-Beltagy et al. 4K MSA + Dialect N/A YES
Abdul-Mageed & Diab (SANA) 225K MSA + Dialect Yes NO
Badaro et al. 150K MSA only N/A YES
Sentiment lexicons related work
Proposed solution
• Building large Arabic datasets and lexicons for sentiment analysis
• Large size
• Multi-domain
• Arabic dialects
• Well documented, tested for sentiment classification
• Publicly available for every one to use
Agenda
• Problem Statement
• Building Multi-Domain Datasets for Sentiment Analysis
• Building Multi-Domain lexicons
• Experiments and Evaluation
• Mining experiments results
Building Datasets
Building datasets from reviewing content on the internet
Building Datasets
• Lack of Arabic reviewing content on the internet:
• Less Arabic based e-commerce & reviewing websites
• Arabic speakers use the English language to write their reviews
English *** , Do you Speak it !!!!
Domain Reviewing Websites Scrapped
Hotel reviews
Restaurant reviews
Product Reviews
Movie Reviews
Building Datasets
Scrapping Arabic Reviewing content on the Internet
Building Datasets
• Normalize different ratings systems into ( positive, negative and neutral )
classes using heuristics.
• Automatic labeling of reviews.
Building Datasets
• Removing redundant and spamming reviews
• Removing contradicting reviews ( Similar Text Different polarity )
• Remove duplicate reviews
Datasets Statistics
Hotels Restaurants Movies Products ALL
#Reviews 15579 11310 1524 14279 42692
#Unique Reviews 15562 10940 1522 5092 33116
#Users 13407 1639 416 7465 24653
#Items 8100 4654 933 5906 19593
Sizes of Extracted Datasets
Datasets Statistics
Number of reviews for each class
Datasets Statistics
Number of tokens per review for each of the datasets
Agenda
• Problem Statement
• Building Multi-Domain Datasets for Sentiment Analysis
• Building Multi-Domain lexicons
• Experiments and Evaluation
• Mining experiments results
Building multi domain lexicons
• Manually hand crafting sentiment lexicons is a tedious task
• Proposed approach  utilizes feature selection and ranking of Support Vector
Machines (SVM)
• SVM with L1 regularization penalty results in sparse coefficient vectors
doesn't
deserve
bad failure happen scene better wonderful enjoyable
‫يستحق‬ ‫ال‬ ‫سئ‬ ‫فشل‬ ……. ‫حصل‬ ‫مشهد‬ …….. ‫افضل‬ ‫رائع‬ ‫ممتع‬
-0.532 -0.52 -0.4 ……. 0 0 …….. 0.270 0.272 0.357
Coefficient vector of a trained support vector machine
Datasets
Training
L1-norm SVM
Selecting Top
Features
Manually
Verification
Multi domain
Lexicons
Building multi domain lexicons
• Train SVM classifier on each of the generated datasets using a unigram + bigram
model
• Omit features corresponding to zero coefficients
• Label features with positive coefficient values as positive lexicon entries
• Label features with negative coefficient values as Negative lexicon entries
• Manually filter and verify resulting lexicon ( a lot easier ! )
Building multi domain lexicon from the
datasets
Hotels Restaurants Movies Products LABR / Books ALL
# non-zero coef.
features
556 1413 526 661 3552 6708
# Manually filtered 218 734 87 369 874 1913
Size of built multi-domain lexicons before and after manual filtration
Building multi domain lexicon from the
datasets
Selected examples from the Generated lexicons:
Hotels Restaurants Movies
‫أعود‬ ‫لن‬
not coming back
‫بارد‬
cold
‫المشاهدة‬ ‫يستحق‬
worth watching
‫المياه‬‫ضعيفة‬
low water pressure
‫يشبع‬
Enough portions
‫برافو‬
Bravo
Agenda
• Problem Statement
• Building Multi-Domain Datasets for Sentiment Analysis
• Building Multi-Domain lexicons
• Experiments and Evaluation
• Mining experiments results
Experiments and bench marking Datasets
• Verify the viability of using the datasets for sentiment analysis
• Test the effectiveness of the generated lexicon
• Export the results of all experiments publicly for further analysis
• Provide easy benchmarking framework for future sentiment classifiers
Experiments Benchmarking the datasets for the task of sentiment analysis :
Experiments and bench marking Datasets
Datasets setups :
• 2 Class sentiment Classification (Positive or Negative)
• 3 Class Sentiment Classification problem (Positive, Negative or Mixed/Neutral )
• Balanced / Unbalanced Setups
• 20%-80% Splits (testing generated lexicons on unseen data)
• Cross validation
Experiments and bench marking Datasets
Feature building Methods :
• Standard feature building methods :
• Count, TF-IDF, Delta-TFIDF
• Features built from generated lexicons :
• (term existence, term count, weighted count )
• Domain specific lexicon, domain general lexicon
• Merging Lexicon based features with other features
Classifiers : Linear SVM, Logistic regression, BNB , KNN and SGD
Experiments and bench marking Datasets
• 3075 experiments, resulted from using all classifiers, features and
Datasets setups combinations together.
• Results are publicly available for further analysis and as benchmarks
Agenda
• Problem Statement
• Building Multi-Domain Datasets for Sentiment Analysis
• Building Multi-Domain lexicons
• Experiments and Evaluation
• Mining experiments results
Mining experiments results
Mining the experiments results to answer questions like :
• What are the top performing classifiers and features combinations ?
• Can we rely only on lexicons for sentiment analysis ?
• What is the effect of combining lexicon based features with other
features ?
• Are shorter documents easier to classify ?
• Are documents richer with subjective words easier to classify ?
Can we rely only on lexicon based
features for sentiment
classification?
Can features generated from lexicons provide an adequate accuracy relative to
other feature generating methods.
Mining experiments results
Features Number of features Average Accuracy
2Class
Lex-domain ~ 500 0.768
Lex-all 1913 0.782
Count ~ 50K features 0.783
Mining experiments results
Features Number of features Average Accuracy
3Class
Lex-domain ~ 500 0.549
Lex-all 1913 0.554
Count ~ 50K features 0.570
Effect of merging lexicon based
features with other features?
Can features generated from lexicons provide an adequate accuracy relative to
other feature generating methods.
Mining experiments results
Features Aggregated Lexicon Average Accuracy Enhancement
2Class
Count
None 0.783
Lex-domain 0.790 + 1 %
Lex-all 0.796 + 1.6 %
TFIDF
None 0.7
Lex-domain 0.791 + 9.1 %
Lex-all 0.8 +10 %
Delta-TFIDF
None 0.692
Lex-domain 0.789 + 9.7 %
Lex-all 0.798 + 10.6 %
Shorter documents are easier to
classify?
Or longer ones?, How about longer ones rich with subjective terms ?
Mining experiments results
Small Space
Mining experiments results
Storyline : Patch Adams was desperate and attempt to commit a
suicide many times, until he was sent to a mental hospital….
……..
Then he started unintentionally helping others through socializing
with them until they have become better
Mining experiments results
• Document length : No. of tokens in per document (log scale)
• Subjectivity score
• Sum of polarities of words that appear in the document (using generated
lexicons)
• Error Rate
• Number of misclassified documents of this specific group (doc. Length and
subjectivity score )
Mining experiments results
The error rate for various document lengths and subjectivity score groups (the Darker the worse)
Conclusion
• Built a large multi-domain datasets for sentiment Analysis ( 33K
reviews)
• Proposed an approach for semi-automatically learning multi-domain
lexicons (~2K)
• Everything is publicly available :
• Datasets (raw + processed)
• Lexicons
• Web Scrappers (to rerun for more recent reviews)
• Experiments code and results
Questions ?
Slides : bit.ly/cicling2015_elsahar_slides
Datasets : bit.ly/cicling2015_elsahar_resources

Más contenido relacionado

Destacado

Guidedesurviedecisionsabsurdes
GuidedesurviedecisionsabsurdesGuidedesurviedecisionsabsurdes
Guidedesurviedecisionsabsurdes
Lesperou
 
Arabic Text mining Classification
Arabic Text mining Classification Arabic Text mining Classification
Arabic Text mining Classification
Zakaria Zubi
 
Sentiment analysis of arabic,a survey
Sentiment analysis of arabic,a surveySentiment analysis of arabic,a survey
Sentiment analysis of arabic,a survey
Arabic_NLP_ImamU2013
 

Destacado (20)

WDAqua introduction presentation
WDAqua introduction presentationWDAqua introduction presentation
WDAqua introduction presentation
 
Word Embeddings, why the hype ?
Word Embeddings, why the hype ? Word Embeddings, why the hype ?
Word Embeddings, why the hype ?
 
Guidedesurviedecisionsabsurdes
GuidedesurviedecisionsabsurdesGuidedesurviedecisionsabsurdes
Guidedesurviedecisionsabsurdes
 
Data mining project
Data mining projectData mining project
Data mining project
 
A Fuzzy Approach For Multi-Domain Sentiment Analysis
A Fuzzy Approach For Multi-Domain Sentiment AnalysisA Fuzzy Approach For Multi-Domain Sentiment Analysis
A Fuzzy Approach For Multi-Domain Sentiment Analysis
 
Sentiment mining- The Design and Implementation of an Internet Public Opinion...
Sentiment mining- The Design and Implementation of an Internet PublicOpinion...Sentiment mining- The Design and Implementation of an Internet PublicOpinion...
Sentiment mining- The Design and Implementation of an Internet Public Opinion...
 
Mike davies sentiment_analysis_presentation_backup
Mike davies sentiment_analysis_presentation_backupMike davies sentiment_analysis_presentation_backup
Mike davies sentiment_analysis_presentation_backup
 
Machine Learning - Object Detection and Classification
Machine Learning - Object Detection and ClassificationMachine Learning - Object Detection and Classification
Machine Learning - Object Detection and Classification
 
Sentiment Analysis in Twitter
Sentiment Analysis in TwitterSentiment Analysis in Twitter
Sentiment Analysis in Twitter
 
Sentiment Analysis in Twitter with Lightweight Discourse Analysis
Sentiment Analysis in Twitter with Lightweight Discourse AnalysisSentiment Analysis in Twitter with Lightweight Discourse Analysis
Sentiment Analysis in Twitter with Lightweight Discourse Analysis
 
Twitter sentiment analysis
Twitter sentiment analysisTwitter sentiment analysis
Twitter sentiment analysis
 
Project prSentiment Analysis of Twitter Data Using Machine Learning Approach...
Project prSentiment Analysis  of Twitter Data Using Machine Learning Approach...Project prSentiment Analysis  of Twitter Data Using Machine Learning Approach...
Project prSentiment Analysis of Twitter Data Using Machine Learning Approach...
 
Arabic Text mining Classification
Arabic Text mining Classification Arabic Text mining Classification
Arabic Text mining Classification
 
Arabic tokenization and stemming
Arabic tokenization and  stemmingArabic tokenization and  stemming
Arabic tokenization and stemming
 
Sentiment tool Project presentaion
Sentiment tool Project presentaionSentiment tool Project presentaion
Sentiment tool Project presentaion
 
A Simple Introduction to Word Embeddings
A Simple Introduction to Word EmbeddingsA Simple Introduction to Word Embeddings
A Simple Introduction to Word Embeddings
 
Sentiment Analaysis on Twitter
Sentiment Analaysis on TwitterSentiment Analaysis on Twitter
Sentiment Analaysis on Twitter
 
Sentiment analysis of arabic,a survey
Sentiment analysis of arabic,a surveySentiment analysis of arabic,a survey
Sentiment analysis of arabic,a survey
 
Tweets Classification using Naive Bayes and SVM
Tweets Classification using Naive Bayes and SVMTweets Classification using Naive Bayes and SVM
Tweets Classification using Naive Bayes and SVM
 
Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier
 

Similar a Building Large Arabic Multi-Domain Resources for Sentiment Analysis

DSpace 7 - Creating High-Quality Software: Update to Development Practices
DSpace 7 - Creating High-Quality Software: Update to Development PracticesDSpace 7 - Creating High-Quality Software: Update to Development Practices
DSpace 7 - Creating High-Quality Software: Update to Development Practices
4Science
 
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
Joaquin Delgado PhD.
 
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning... RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
S. Diana Hu
 
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Lucidworks
 

Similar a Building Large Arabic Multi-Domain Resources for Sentiment Analysis (20)

Extending Solr: Behind CareerBuilder’s Cloud-like Knowledge Discovery Platfor...
Extending Solr: Behind CareerBuilder’s Cloud-like Knowledge Discovery Platfor...Extending Solr: Behind CareerBuilder’s Cloud-like Knowledge Discovery Platfor...
Extending Solr: Behind CareerBuilder’s Cloud-like Knowledge Discovery Platfor...
 
Extending Solr: Building a Cloud-like Knowledge Discovery Platform
Extending Solr: Building a Cloud-like Knowledge Discovery PlatformExtending Solr: Building a Cloud-like Knowledge Discovery Platform
Extending Solr: Building a Cloud-like Knowledge Discovery Platform
 
Design Systems at Scale
Design Systems at ScaleDesign Systems at Scale
Design Systems at Scale
 
DSpace 7 - Creating High-Quality Software: Update to Development Practices
DSpace 7 - Creating High-Quality Software: Update to Development PracticesDSpace 7 - Creating High-Quality Software: Update to Development Practices
DSpace 7 - Creating High-Quality Software: Update to Development Practices
 
Evaluation of web scale discovery services
Evaluation of web scale discovery servicesEvaluation of web scale discovery services
Evaluation of web scale discovery services
 
Live Blog Analysis
Live Blog AnalysisLive Blog Analysis
Live Blog Analysis
 
How To Use Selenium Successfully
How To Use Selenium SuccessfullyHow To Use Selenium Successfully
How To Use Selenium Successfully
 
SCRIMPS-STD: Test Automation Design Principles - and asking the right questions!
SCRIMPS-STD: Test Automation Design Principles - and asking the right questions!SCRIMPS-STD: Test Automation Design Principles - and asking the right questions!
SCRIMPS-STD: Test Automation Design Principles - and asking the right questions!
 
Text Mining & Sentiment Analysis made easy, with Azure and Power BI
Text Mining & Sentiment Analysis made easy, with Azure and Power BIText Mining & Sentiment Analysis made easy, with Azure and Power BI
Text Mining & Sentiment Analysis made easy, with Azure and Power BI
 
AOEcon17: Searchperience - The journey from PHP and Solr to Scala and Elastic...
AOEcon17: Searchperience - The journey from PHP and Solr to Scala and Elastic...AOEcon17: Searchperience - The journey from PHP and Solr to Scala and Elastic...
AOEcon17: Searchperience - The journey from PHP and Solr to Scala and Elastic...
 
Natural language processing and search
Natural language processing and searchNatural language processing and search
Natural language processing and search
 
Towards a Quality Assessment of Web Corpora for Language Technology Applications
Towards a Quality Assessment of Web Corpora for Language Technology ApplicationsTowards a Quality Assessment of Web Corpora for Language Technology Applications
Towards a Quality Assessment of Web Corpora for Language Technology Applications
 
Got documents - The Raven Bouns Edition
Got documents - The Raven Bouns EditionGot documents - The Raven Bouns Edition
Got documents - The Raven Bouns Edition
 
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.comEnhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
 
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
 
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning... RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 
Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global
Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global
Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global
 
Software Analytics: Data Analytics for Software Engineering
Software Analytics: Data Analytics for Software EngineeringSoftware Analytics: Data Analytics for Software Engineering
Software Analytics: Data Analytics for Software Engineering
 
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
 
Test-Driven Development in the Corporate Workplace
Test-Driven Development in the Corporate WorkplaceTest-Driven Development in the Corporate Workplace
Test-Driven Development in the Corporate Workplace
 

Último

Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
ssuser79fe74
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptx
AlMamun560346
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Sérgio Sacani
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
Sérgio Sacani
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
RizalinePalanog2
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Sérgio Sacani
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
PirithiRaju
 

Último (20)

Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptx
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts ServiceJustdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
 
Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .
 
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdf
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.
 

Building Large Arabic Multi-Domain Resources for Sentiment Analysis

  • 1. Building Large Arabic Multi-domain Resources for Sentiment Analysis Hady ElSahar and Samhaa R. El-Beltagy Center for Informatics Science, Nile University CICLing 2015 – April 19, 2014 hadyelsahar@gmail.com
  • 2. Agenda • Problem Statement • Building Multi-Domain Datasets for Sentiment Analysis • Building Multi-Domain lexicons • Experiments and Evaluation • Mining experiments results
  • 4. Problem Statement • Small size • Domain Specificity • Not publicly available • Insufficient coverage of different Arabic dialects and non standard terms Current resources for sentiment analysis suffer many deficiencies:
  • 5. Problem Statement Author Dataset name Size Multi Domain Publicly Available Rushdi-Saleh et al. OCA 500 NO YES Abdul-Mageed & Diab AWATIF < 10K Yes NO Aly, M. & Atiya, A. LABR 63K NO YES Eshrag Refaee et al. Twitter Corpus 8,868 N/A YES Sentiment Datasets related work
  • 6. Problem Statement Author Size MSA / Dialect Multi Domain Publicly Available El-Beltagy et al. 4K MSA + Dialect N/A YES Abdul-Mageed & Diab (SANA) 225K MSA + Dialect Yes NO Badaro et al. 150K MSA only N/A YES Sentiment lexicons related work
  • 7. Proposed solution • Building large Arabic datasets and lexicons for sentiment analysis • Large size • Multi-domain • Arabic dialects • Well documented, tested for sentiment classification • Publicly available for every one to use
  • 8. Agenda • Problem Statement • Building Multi-Domain Datasets for Sentiment Analysis • Building Multi-Domain lexicons • Experiments and Evaluation • Mining experiments results
  • 9. Building Datasets Building datasets from reviewing content on the internet
  • 10. Building Datasets • Lack of Arabic reviewing content on the internet: • Less Arabic based e-commerce & reviewing websites • Arabic speakers use the English language to write their reviews English *** , Do you Speak it !!!!
  • 11. Domain Reviewing Websites Scrapped Hotel reviews Restaurant reviews Product Reviews Movie Reviews Building Datasets Scrapping Arabic Reviewing content on the Internet
  • 12. Building Datasets • Normalize different ratings systems into ( positive, negative and neutral ) classes using heuristics. • Automatic labeling of reviews.
  • 13. Building Datasets • Removing redundant and spamming reviews • Removing contradicting reviews ( Similar Text Different polarity ) • Remove duplicate reviews
  • 14. Datasets Statistics Hotels Restaurants Movies Products ALL #Reviews 15579 11310 1524 14279 42692 #Unique Reviews 15562 10940 1522 5092 33116 #Users 13407 1639 416 7465 24653 #Items 8100 4654 933 5906 19593 Sizes of Extracted Datasets
  • 15. Datasets Statistics Number of reviews for each class
  • 16. Datasets Statistics Number of tokens per review for each of the datasets
  • 17. Agenda • Problem Statement • Building Multi-Domain Datasets for Sentiment Analysis • Building Multi-Domain lexicons • Experiments and Evaluation • Mining experiments results
  • 18. Building multi domain lexicons • Manually hand crafting sentiment lexicons is a tedious task • Proposed approach  utilizes feature selection and ranking of Support Vector Machines (SVM) • SVM with L1 regularization penalty results in sparse coefficient vectors doesn't deserve bad failure happen scene better wonderful enjoyable ‫يستحق‬ ‫ال‬ ‫سئ‬ ‫فشل‬ ……. ‫حصل‬ ‫مشهد‬ …….. ‫افضل‬ ‫رائع‬ ‫ممتع‬ -0.532 -0.52 -0.4 ……. 0 0 …….. 0.270 0.272 0.357 Coefficient vector of a trained support vector machine
  • 19. Datasets Training L1-norm SVM Selecting Top Features Manually Verification Multi domain Lexicons Building multi domain lexicons • Train SVM classifier on each of the generated datasets using a unigram + bigram model • Omit features corresponding to zero coefficients • Label features with positive coefficient values as positive lexicon entries • Label features with negative coefficient values as Negative lexicon entries • Manually filter and verify resulting lexicon ( a lot easier ! )
  • 20. Building multi domain lexicon from the datasets Hotels Restaurants Movies Products LABR / Books ALL # non-zero coef. features 556 1413 526 661 3552 6708 # Manually filtered 218 734 87 369 874 1913 Size of built multi-domain lexicons before and after manual filtration
  • 21. Building multi domain lexicon from the datasets Selected examples from the Generated lexicons: Hotels Restaurants Movies ‫أعود‬ ‫لن‬ not coming back ‫بارد‬ cold ‫المشاهدة‬ ‫يستحق‬ worth watching ‫المياه‬‫ضعيفة‬ low water pressure ‫يشبع‬ Enough portions ‫برافو‬ Bravo
  • 22. Agenda • Problem Statement • Building Multi-Domain Datasets for Sentiment Analysis • Building Multi-Domain lexicons • Experiments and Evaluation • Mining experiments results
  • 23. Experiments and bench marking Datasets • Verify the viability of using the datasets for sentiment analysis • Test the effectiveness of the generated lexicon • Export the results of all experiments publicly for further analysis • Provide easy benchmarking framework for future sentiment classifiers Experiments Benchmarking the datasets for the task of sentiment analysis :
  • 24. Experiments and bench marking Datasets Datasets setups : • 2 Class sentiment Classification (Positive or Negative) • 3 Class Sentiment Classification problem (Positive, Negative or Mixed/Neutral ) • Balanced / Unbalanced Setups • 20%-80% Splits (testing generated lexicons on unseen data) • Cross validation
  • 25. Experiments and bench marking Datasets Feature building Methods : • Standard feature building methods : • Count, TF-IDF, Delta-TFIDF • Features built from generated lexicons : • (term existence, term count, weighted count ) • Domain specific lexicon, domain general lexicon • Merging Lexicon based features with other features Classifiers : Linear SVM, Logistic regression, BNB , KNN and SGD
  • 26. Experiments and bench marking Datasets • 3075 experiments, resulted from using all classifiers, features and Datasets setups combinations together. • Results are publicly available for further analysis and as benchmarks
  • 27. Agenda • Problem Statement • Building Multi-Domain Datasets for Sentiment Analysis • Building Multi-Domain lexicons • Experiments and Evaluation • Mining experiments results
  • 28. Mining experiments results Mining the experiments results to answer questions like : • What are the top performing classifiers and features combinations ? • Can we rely only on lexicons for sentiment analysis ? • What is the effect of combining lexicon based features with other features ? • Are shorter documents easier to classify ? • Are documents richer with subjective words easier to classify ?
  • 29. Can we rely only on lexicon based features for sentiment classification? Can features generated from lexicons provide an adequate accuracy relative to other feature generating methods.
  • 30. Mining experiments results Features Number of features Average Accuracy 2Class Lex-domain ~ 500 0.768 Lex-all 1913 0.782 Count ~ 50K features 0.783
  • 31. Mining experiments results Features Number of features Average Accuracy 3Class Lex-domain ~ 500 0.549 Lex-all 1913 0.554 Count ~ 50K features 0.570
  • 32. Effect of merging lexicon based features with other features? Can features generated from lexicons provide an adequate accuracy relative to other feature generating methods.
  • 33. Mining experiments results Features Aggregated Lexicon Average Accuracy Enhancement 2Class Count None 0.783 Lex-domain 0.790 + 1 % Lex-all 0.796 + 1.6 % TFIDF None 0.7 Lex-domain 0.791 + 9.1 % Lex-all 0.8 +10 % Delta-TFIDF None 0.692 Lex-domain 0.789 + 9.7 % Lex-all 0.798 + 10.6 %
  • 34. Shorter documents are easier to classify? Or longer ones?, How about longer ones rich with subjective terms ?
  • 36. Mining experiments results Storyline : Patch Adams was desperate and attempt to commit a suicide many times, until he was sent to a mental hospital…. …….. Then he started unintentionally helping others through socializing with them until they have become better
  • 37. Mining experiments results • Document length : No. of tokens in per document (log scale) • Subjectivity score • Sum of polarities of words that appear in the document (using generated lexicons) • Error Rate • Number of misclassified documents of this specific group (doc. Length and subjectivity score )
  • 38. Mining experiments results The error rate for various document lengths and subjectivity score groups (the Darker the worse)
  • 39. Conclusion • Built a large multi-domain datasets for sentiment Analysis ( 33K reviews) • Proposed an approach for semi-automatically learning multi-domain lexicons (~2K) • Everything is publicly available : • Datasets (raw + processed) • Lexicons • Web Scrappers (to rerun for more recent reviews) • Experiments code and results
  • 40. Questions ? Slides : bit.ly/cicling2015_elsahar_slides Datasets : bit.ly/cicling2015_elsahar_resources