This document provides an overview of the components and architecture for a project to acquire unstructured data from various sources for sentiment analysis. The main objectives are to streamline data acquisition, create corpora for contextual opinions and sentiments, and detect trends based on reviews and comments. The proposed architecture uses Python, Django, Scrapy, MySQL/MongoDB/Hbase for data storage, and R Project and Hadoop for text mining and massive storage. It describes how crawlers and APIs will be used to gather data from social media and other sources for preprocessing, analysis and output of results.
2. Objectives
•Streamlineand facilitatethe processof unstructureddata acquisition
•Createand manage corpora’sfor contextualopinions and sentiments
•Detecttrends basedon contexctualreviews, comments, discussions…
•Runand train modelsfor sentiment or opinion analysis
•ProvideFigures, resultsand graphs as outputs
3. Software components
•Python
–Program language
•Django : Web application container
•Scapy: Web Crawler
•Librairies : Twitter,
•MySQL / MongoDB/ Hbase
–For the time being, no absolutechoiceismade But the final solution couldbea mix of differentdatabasesdependingon the nature of the use.
•R Project
–R Project willbeusedwheneverspecifictextmininglibrariesare missingin python or itbecomeeasierto use R insteadof python. In thatcase, the R scripts willbeencapsulatedin python programs.
•Hadoop
–For massive storagewewilluse Hadoop. The architecture isnot yetdepicted.
–It isusedfor Rawdata storage.
5. Architecture components
1
Data sources : The accesswillbemanagedvia API or Crawls. Sources are all onesrelatedto social media -> blogs, forums, advisors, social web… In general, all media wheresentiment / opinion are expressed.
2
Web Interface to interactwiththe system -> to manage inputs, configurations, outputs…
3
There willbea mix betweenScrapy(the Crawler) and python scripts for usingAPIs. Basically, the enginewillbeusedto gatherall data sources and store themfor furtherprocessing(pre- processingand analysis).
4
There willbea mix betweenScrapy(the Crawler) and python scripts for usingAPIs. Basically, the enginewillbeusedto gatherall data sources and store themfor furtherprocessing.
5
The targetdatabasesolution isnot yetselected. The objective isto store all the relative content wheneverisrawdata, configuration items or ouputresults.
6. Characteristicsof Sentiment Analysis
Sentiment = Holder + Polarity + Target + Auxiliary
–Holder: who expresses the sentiment
–Target: what/whom the sentiment is expressed to
–Polarity: the nature of the sentiment (e.g., positiveor negative)
“The games in iPhone 4s are pretty funny!”
Feature/Aspect Target Polarity : Positive
Holder = the user/reviewer
Auxiliary
•Strength : Differentiate the intensity
•Confidence : Measure the reliability of the sentiment
•Summary : Explain the reason inducing the sentiment
•Time
7. Basic Tasks
•Holderdetection –Find who express the sentiment
•Targetrecognition –Find whom/what the sentiment is expressed towards
•Sentiment (Polarity) classification –Positive, negative, neutral
•Opinion summarization
•Opinion spam detection
8. Subjectivityversus Sentiment
•Sentiment analysis also known as opinion mining.
•Attempts to identify the opinion/sentiment that a person may hold towards an object
•It is a finer grain analysis compared to subjectivity analysis
9. Lexicon Based Sentiment Classification
Basic idea
•Use the dominant polarity of the opinion words in the sentence to determine its polarity :
•If positive/negative opinion prevails, the opinion sentence is regarded as positive/negative
•Lexicon + Counting
•Lexicon + Grammar Rule + Inference Method
Example Lexicon :
http://www.wjh.harvard.edu/~inquirer
http://www.cs.uic.edu/~liub/FBS/opinion-lexicon-English.rar
http://sentiwordnet.isti.cnr.it/
10. Sentiment AnalysisTasks
Level
TaskDescription
Document
•Task: sentiment classification of reviews
•Classes: positive, negative, and neutral
•Assumption: each document (or review) focuses on a single object (not true in many discussion posts) and contains opinion from a single opinion holder.
Sentence
•Task 1: identifying subjective/opinionated sentences
•Classes: objective and subjective (opinionated)
•Task 2: sentiment classification of sentences
•Classes: positive, negative and neutral.
•Assumption: a sentence contains only one opinion; not true in many cases.
•Then we can also consider clauses or phrases.
Feature
•Task 1: Identify and extract object features that have been commented on by an opinion holder (e.g., a reviewer).
•Task 2: Determine whether the opinions on the features are positive, negative or neutral.
•Task 3: Group feature synonyms.
•Produce a feature-based opinion summary of multiple reviews.
11. Sometools
Lexicon-based tools
•Use sentiment and subjectivity lexicons
•Rule-based classifier
•A sentence is subjective if it has at least two words in the lexicon
•A sentence is objective otherwise
Corpus-based tools
•Use corpora annotated for subjectivity and/or sentiment
•Train machine learning algorithms:
•Naïve bayes
•Decision trees
•SVM
•…
•Learn to automatically annotate new text
13. Sentiment Analysis: Holderdetection
Identifying Sources of Opinions with Conditional Random Fields and Extraction Patterns
International officers believe that the EU will prevail.
International officers said US officials want the EU to prevail.
•View source identification as an information extraction task and tackle the problem using sequence tagging and pattern matching techniques simultaneously
•Linear-chain CRF model to identify opinion sources
•Patterns incorporated as features
15. Sentiment Analysis: Twitter
1.Tweet normalization –A simple rule-based model –“gooood” to “good”, “luve” to “love”
2.POS tagging –OpenNLPPOS tagger
3.Word stemming –A word stem mapping table (about 20,000 entries)
4.Syntactic parsing –A Maximum Spanning Tree dependency parser
16. Crawlingscenario : Definition
Scenario x
Instance 1
Instance 2
Instance n
URLS sélectionnées
Paramètres de configuration
Name
Key words
…
•Scenario : 1 -> n : Category.
•Theme: n -> n : Scenario
•Scenario : 1 -> n : instance
•The scenario definethe type of Crawl wewantto run. It istiedto the notion of instance whichisconsideredas a specificconfiguration of scenario.
Module gestion des URLS
Module gestion de paramètres de configuration
Il faudra se pencher sur l’interface GUI en développement de Nutchet s’en inspirer pour la gestion des paramètres et des URLS.
Theme
Category