SlideShare una empresa de Scribd logo
1 de 51
QUESTIONS ABOUT
QUESTIONS: AN
EMPIRICAL ANALYSIS OF
INFORMATION NEEDS
ON TWITTER
ABSTRACT
 We take the initiative to extract and analyze information
needs from billions of online conversations collected from
Twitter.
 We can accurately detect real questions in tweets
 We then present a comprehensive analysis of the large-scale
collection of information needs we extracted.
INTRODUCTION
 13% of a random sample of tweets were questions.
1. Broadcasting
2. Targeting the question to particular friends.
 Information needs through social platforms present a higher
coverage of topics related to human interest, entertainment,
and technology, compared to search engine queries.
CONTRIBUTION
 We present the first very large scale and longitudinal study of
information needs in Twitter.
 We prove that information needs detected on Twitter have a
considerable power of predicting the trends of search engine
queries.
 Through the in-depth analysis of various types of time series,
we find many interesting patterns related to the entropy of
language and bursts of information needs.
EXPERIMENT SETUP
 The collection covers a period of 358 days, from July 10th,
2011 to June 31st 2012.
 4,580,153,001 tweets
 We focus on tweets that contain at least one question mark.
 81 .5% of information needs asked through social platforms
were explicitly phrased as questions and included a question
mark.
 In our collection of tweets, 10.45 % of tweets contain explicit
appearance of question mark(s).
DETECTING INFORMATION NEEDS
 A text classification problem .
1. Give a formal definition of this problem and generate a set
of labeled tweets as training/testing examples.
2. Introduce a classifier trained with these examples, using the
state-of-the-art machine learning algorithms and a
comprehensive collection of features .
DEfiNITION AND RUBRICS
 “real questions” :
 A tweet conveys an information need, or is a real question, if
it expects an informational answer from either the general
audience or particular recipients.
1. it requests for a piece of factual knowledge, or a
confirmation of a piece of factual knowledge
2. it requests for an opinion, idea, preference,
recommendation, or personal plan of the recipient(s), as
well as a confirmation of such information .
HUMAN ANNOTATION





two human annotators
sampled 5,000 tweets
3,119 tweets are labeled as real tweets
1 ,595 are labeled as conveying an information need and
1 ,524 are labeled not conveying an information need

 The inter-rater reliability measured by Cohen’s kappa score is
0.8350
TEXT CLASSIFICATION
Feature Extraction
 Lexical features / the semantic knowledge base WordNet /
syntactical features
 four dif ferent types of feature from each tweet:
lexical ngrams, synonyms and hypernyms of words(obtained from the
WordNet), ngrams of the part-of-speech (POS) tags, and light
metadata and statistical features such as the length of the tweet and
coverage of vocabulary,etc..
TEXT CLASSIFICATION
 Lexical Features
 We included unigrams, bigrams, as well as trigrams .
 For example tweets beginning with the 5Ws(who, when, what,
where, and why) are more likely to be real questions.

 44,121lexicalfeatures.
TEXT CLASSIFICATION
 WordNet Features
 synonyms
 hypernyms
 By doing this, our algorithm can also handle words that
haven’t been seen in the training data.
 23,277 WordNet features are extracted
TEXT CLASSIFICATION
 Part-of-Speech Features
 Capture light syntactic information .
1. given a tweet with n words, w 1 ;w 2 ; … ;w n , we extract
grams from the part -of-speech sequence of the tweet, is t
1 ;t 2 ;…;t n ,
2. Extract unigrams, bigrams and trigrams from this part of-speech sequence as additional features of the tweet .
 3,902 POS features are extracted in total
TEXT CLASSIFICATION
 Meta Features
 6 meta data features and simple statistical features of the
tweet
 such as the length of the tweets, the number of words, the
coverage of vocabulary, the number of capitalized words,
whether or not the tweet contains a URL, and whether or not it
mentions other users.
TEXT CLASSIFICATION
Feature Selection
 Reduce the dimensionality of the data
 Bi-Normal Separation

 tpr = tp= ( tp + fn )
 fpr = fp= ( fp + tn )

 F is the Normal cumulative distribution function
TRAINING CLASSIFIER
1. train four independent classifiers using the Support Vector
Machine
2. combine the four classifiers that represent four types of
features into one stronger classier using boosting, Adaptive
Boosting
 Adaboost is an ef fective algorithm that trains a strong
classifier based on several groups of weak classifiers.
 AdaBoost方法是一种迭代算法,在每一轮中加入一个新的弱分类
器,直到达到某个预定的足够小的错误率。每一个训练样本都被赋
予一个权重,表明它被某个分类器选入训练集的概率。如果某个样
本点已经被准确地分类,那么在构造下一个训练集中,它被选中的
概率就被降低;相反,如果某个样本点没有被准确地分类,那么它
的权重就得到提高。
TRAINING CLASSIFIER
 After several iterations, when the combination of weak
classifiers starts to achieve a higher performance, the
diversity inside the combination is getting lower.
 add a parameter to control for the diversity of the weak
learners in each iteration .
 The diversity that a new classifier could add in iteration t is
defined as follows:
 The diversity of a classier represents how much new
information it could provide to a group of classifiers that have
already been trained in Adaboost.
EVALUATION OF THE CLASSIFIER
EVALUATION OF THE CLASSIFIER
EVALUATION OF THE CLASSIFIER
 The accuracy of the classifier improved from 85.6% to 86.6%
 The small margin suggests that the lexical features are strong
enough in detecting information needs, while other types of
features add little to the success.
ANALYZING INFORMATION NEEDS
 136,841,672 tweets conveying information need between July
10th 2011 to June 31st 2012.
 This is roughly a proportion of 3% of all tweets, and 28.6% of
tweets with question marks.
GENERAL TREND
 first 5 months
 we normalize the time series so that the two curves are easier
to be aligned on the plot
GENERAL TREND
KEYWORDS
 what people ask .
KEYWORDS
BURSTINESS
 we adopt a straightforward solution to detect similar burst
events in the time series of information needs and the
background.
ENTROPY
PREDICTIVE POWER
CONCLUSION
 we present the first large-scale analysis of information needs,
or questions, in Twitter.
 We proposed an automatic classification algorithm that
distinguishes real questions from tweets with question marks
 We then present a comprehensive analysis of the large-scale
collection of information needs we extracted.
ON PARTICIPATIONIN
GROUP CHATS ON
TWITTER
ABSTRACT
 To predict whether a user that attended her first session in a
particular Twitter chat group will return to the group, we build
5F Model that captures five different factors: individual
initiative, group characteristics, perceived receptivity,
linguistic affinity, and geographical proximity .
 The research question we investigate is what factors ensure continued individual participation in a Twitter chat.
5F MODEL
 Individual Initiative
1. user tweetcount denotes the number of tweets the user
contributes to the session.
2. userurl denotes the number of urls the user contributes to
the chat session.
3. usermentions is the total number of times the user
mentions another (by using @).
4. userretweets is the number of retweets by the newcomer
user and captures the amount of information she found to
be worth sharing with her followers.
5F MODEL
 Group Characteristics
5F MODEL
 Perceived Receptivity
5F MODEL
 Linguistic Af finity
 We make use of Linguistic Inquir y and Word Count ( LIWC) to
compare linguistic markers between a user and a group .
 LIWC is a text analysis software that calculates the degree to
which people use dif ferent categories of words across a wide
array of texts
 We consider the set of tweets a user ui shares in her first
session as a text document and compute the value of each
linguistic marker to obtain her LIWC-vector for that particular
session.
GEOGRAPHIC PROXIMIT Y
 the distance d (in meters) between two users ui and uj
DATA SET
 Group Chats Studied
 June 2010-July 2012.
 By identifying the hours of high activity,we capture the
sessions for each chat
SALIENT STATISTICS
 Distribution of the number of users in and outside chat
sessions:
SALIENT STATISTICS
 Distribution of the number of distinct chats users
 participate in:
SALIENT STATISTICS
 Degree distribution of education chat users:
SALIENT STATISTICS
 Geographical distribution of education chat users
STATISTICAL ANALYSIS
USER SURVEY
 an online survey of 26 questions
USER SURVEY
 Usage, Advantages and Disadvantages
USER SURVEY
CONCLUSIONS
 We developed 5F Model that predicts whether a person
attending her first chat session in a particular Twitter chat
group will return to the group.
 We performed statistical data analysis for thirty educational
Twitter chats involving 71411 users and 730944 tweets over a
period of two years .
 We also complemented the results of statistical analysis with
a survey study.

Más contenido relacionado

La actualidad más candente

Python report on twitter sentiment analysis
Python report on twitter sentiment analysisPython report on twitter sentiment analysis
Python report on twitter sentiment analysisAntaraBhattacharya12
 
Twitter sentimentanalysis report
Twitter sentimentanalysis reportTwitter sentimentanalysis report
Twitter sentimentanalysis reportSavio Aberneithie
 
IRE2014-Sentiment Analysis
IRE2014-Sentiment AnalysisIRE2014-Sentiment Analysis
IRE2014-Sentiment AnalysisGangasagar Patil
 
Twitter sentiment analysis ppt
Twitter sentiment analysis pptTwitter sentiment analysis ppt
Twitter sentiment analysis pptSonuCreation
 
Eswc2013 audience short
Eswc2013 audience shortEswc2013 audience short
Eswc2013 audience shortClaudia Wagner
 
sentiment analysis text extraction from social media
sentiment  analysis text extraction from social media sentiment  analysis text extraction from social media
sentiment analysis text extraction from social media Ravindra Chaudhary
 
Sentiment analysis in twitter using python
Sentiment analysis in twitter using pythonSentiment analysis in twitter using python
Sentiment analysis in twitter using pythonCloudTechnologies
 
Datapedia Analysis Report
Datapedia Analysis ReportDatapedia Analysis Report
Datapedia Analysis ReportAbanoub Amgad
 
A review of sentiment analysis approaches in big
A review of sentiment analysis approaches in bigA review of sentiment analysis approaches in big
A review of sentiment analysis approaches in bigNurfadhlina Mohd Sharef
 
SentiCheNews - Sentiment Analysis on Newspapers and Tweets
SentiCheNews - Sentiment Analysis on Newspapers and TweetsSentiCheNews - Sentiment Analysis on Newspapers and Tweets
SentiCheNews - Sentiment Analysis on Newspapers and Tweets🧑‍💻 Manuel Coppotelli
 
FAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MINING
FAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MININGFAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MINING
FAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MININGijnlc
 
Sensing Trending Topics in Twitter for Greater Jakarta Area
Sensing Trending Topics in Twitter for Greater Jakarta Area Sensing Trending Topics in Twitter for Greater Jakarta Area
Sensing Trending Topics in Twitter for Greater Jakarta Area IJECEIAES
 
Evolving Swings (topics) from Social Streams using Probability Model
Evolving Swings (topics) from Social Streams using Probability ModelEvolving Swings (topics) from Social Streams using Probability Model
Evolving Swings (topics) from Social Streams using Probability ModelIJERA Editor
 
Sentiment Analysis in Twitter with Lightweight Discourse Analysis
Sentiment Analysis in Twitter with Lightweight Discourse AnalysisSentiment Analysis in Twitter with Lightweight Discourse Analysis
Sentiment Analysis in Twitter with Lightweight Discourse Analysis Naveen Kumar
 
Ontology based sentiment analysis
Ontology based sentiment analysisOntology based sentiment analysis
Ontology based sentiment analysisprathako
 
Sentiment analyzer and opinion mining
Sentiment analyzer and opinion miningSentiment analyzer and opinion mining
Sentiment analyzer and opinion miningAnkush Mehta
 

La actualidad más candente (20)

Python report on twitter sentiment analysis
Python report on twitter sentiment analysisPython report on twitter sentiment analysis
Python report on twitter sentiment analysis
 
Twitter sentimentanalysis report
Twitter sentimentanalysis reportTwitter sentimentanalysis report
Twitter sentimentanalysis report
 
Twitter sentiment analysis ppt
Twitter sentiment analysis pptTwitter sentiment analysis ppt
Twitter sentiment analysis ppt
 
IRE2014-Sentiment Analysis
IRE2014-Sentiment AnalysisIRE2014-Sentiment Analysis
IRE2014-Sentiment Analysis
 
Twitter sentiment analysis ppt
Twitter sentiment analysis pptTwitter sentiment analysis ppt
Twitter sentiment analysis ppt
 
Eswc2013 audience short
Eswc2013 audience shortEswc2013 audience short
Eswc2013 audience short
 
sentiment analysis text extraction from social media
sentiment  analysis text extraction from social media sentiment  analysis text extraction from social media
sentiment analysis text extraction from social media
 
Sentiment analysis in twitter using python
Sentiment analysis in twitter using pythonSentiment analysis in twitter using python
Sentiment analysis in twitter using python
 
Datapedia Analysis Report
Datapedia Analysis ReportDatapedia Analysis Report
Datapedia Analysis Report
 
A review of sentiment analysis approaches in big
A review of sentiment analysis approaches in bigA review of sentiment analysis approaches in big
A review of sentiment analysis approaches in big
 
SentiCheNews - Sentiment Analysis on Newspapers and Tweets
SentiCheNews - Sentiment Analysis on Newspapers and TweetsSentiCheNews - Sentiment Analysis on Newspapers and Tweets
SentiCheNews - Sentiment Analysis on Newspapers and Tweets
 
FAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MINING
FAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MININGFAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MINING
FAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MINING
 
Opinion Mining – Twitter
Opinion Mining – TwitterOpinion Mining – Twitter
Opinion Mining – Twitter
 
Twitter Analytics
Twitter AnalyticsTwitter Analytics
Twitter Analytics
 
Sensing Trending Topics in Twitter for Greater Jakarta Area
Sensing Trending Topics in Twitter for Greater Jakarta Area Sensing Trending Topics in Twitter for Greater Jakarta Area
Sensing Trending Topics in Twitter for Greater Jakarta Area
 
Evolving Swings (topics) from Social Streams using Probability Model
Evolving Swings (topics) from Social Streams using Probability ModelEvolving Swings (topics) from Social Streams using Probability Model
Evolving Swings (topics) from Social Streams using Probability Model
 
Sentiment Analysis in Twitter with Lightweight Discourse Analysis
Sentiment Analysis in Twitter with Lightweight Discourse AnalysisSentiment Analysis in Twitter with Lightweight Discourse Analysis
Sentiment Analysis in Twitter with Lightweight Discourse Analysis
 
Evaluation Datasets for Twitter Sentiment Analysis: A survey and a new datase...
Evaluation Datasets for Twitter Sentiment Analysis: A survey and a new datase...Evaluation Datasets for Twitter Sentiment Analysis: A survey and a new datase...
Evaluation Datasets for Twitter Sentiment Analysis: A survey and a new datase...
 
Ontology based sentiment analysis
Ontology based sentiment analysisOntology based sentiment analysis
Ontology based sentiment analysis
 
Sentiment analyzer and opinion mining
Sentiment analyzer and opinion miningSentiment analyzer and opinion mining
Sentiment analyzer and opinion mining
 

Destacado

Generating event storylines from microblogs
Generating event storylines from microblogsGenerating event storylines from microblogs
Generating event storylines from microblogsmoresmile
 
Presentation-for-broker
Presentation-for-brokerPresentation-for-broker
Presentation-for-brokerDenis Aristov
 
Is it time for a career switch
Is it time for a career switchIs it time for a career switch
Is it time for a career switchmoresmile
 
Using content and interactions for discovering communities in
Using content and interactions for discovering communities inUsing content and interactions for discovering communities in
Using content and interactions for discovering communities inmoresmile
 
Magnet community identification on social networks
Magnet community identification on social networksMagnet community identification on social networks
Magnet community identification on social networksmoresmile
 
Finding bursty topics from microblogs
Finding bursty topics from microblogsFinding bursty topics from microblogs
Finding bursty topics from microblogsmoresmile
 
Topical keyphrase extraction from twitter
Topical keyphrase extraction from twitterTopical keyphrase extraction from twitter
Topical keyphrase extraction from twittermoresmile
 
Презентация проекта MegaStrahovka.ru
Презентация проекта MegaStrahovka.ru Презентация проекта MegaStrahovka.ru
Презентация проекта MegaStrahovka.ru Denis Aristov
 
Презентация CarScan24.ru
Презентация CarScan24.ruПрезентация CarScan24.ru
Презентация CarScan24.ruDenis Aristov
 
When relevance is not enough
When relevance is not enoughWhen relevance is not enough
When relevance is not enoughmoresmile
 
Modul pengukuran. aliran fluida.
Modul   pengukuran. aliran fluida.Modul   pengukuran. aliran fluida.
Modul pengukuran. aliran fluida.bacukids
 
Exploring social influence via posterior effect of word of-mouth
Exploring social influence via posterior effect of word of-mouthExploring social influence via posterior effect of word of-mouth
Exploring social influence via posterior effect of word of-mouthmoresmile
 
Accounting principles
Accounting principlesAccounting principles
Accounting principlescantaboop
 
Event summarization using tweets
Event summarization using tweetsEvent summarization using tweets
Event summarization using tweetsmoresmile
 
Doppler effect experiment and applications
Doppler effect experiment and applicationsDoppler effect experiment and applications
Doppler effect experiment and applicationsmarina fayez
 

Destacado (18)

HSI_Intro_Short
HSI_Intro_ShortHSI_Intro_Short
HSI_Intro_Short
 
Generating event storylines from microblogs
Generating event storylines from microblogsGenerating event storylines from microblogs
Generating event storylines from microblogs
 
Presentation-for-broker
Presentation-for-brokerPresentation-for-broker
Presentation-for-broker
 
Doppler
DopplerDoppler
Doppler
 
Is it time for a career switch
Is it time for a career switchIs it time for a career switch
Is it time for a career switch
 
Using content and interactions for discovering communities in
Using content and interactions for discovering communities inUsing content and interactions for discovering communities in
Using content and interactions for discovering communities in
 
Magnet community identification on social networks
Magnet community identification on social networksMagnet community identification on social networks
Magnet community identification on social networks
 
Finding bursty topics from microblogs
Finding bursty topics from microblogsFinding bursty topics from microblogs
Finding bursty topics from microblogs
 
Topical keyphrase extraction from twitter
Topical keyphrase extraction from twitterTopical keyphrase extraction from twitter
Topical keyphrase extraction from twitter
 
Презентация проекта MegaStrahovka.ru
Презентация проекта MegaStrahovka.ru Презентация проекта MegaStrahovka.ru
Презентация проекта MegaStrahovka.ru
 
Презентация CarScan24.ru
Презентация CarScan24.ruПрезентация CarScan24.ru
Презентация CarScan24.ru
 
Presentation2
Presentation2Presentation2
Presentation2
 
When relevance is not enough
When relevance is not enoughWhen relevance is not enough
When relevance is not enough
 
Modul pengukuran. aliran fluida.
Modul   pengukuran. aliran fluida.Modul   pengukuran. aliran fluida.
Modul pengukuran. aliran fluida.
 
Exploring social influence via posterior effect of word of-mouth
Exploring social influence via posterior effect of word of-mouthExploring social influence via posterior effect of word of-mouth
Exploring social influence via posterior effect of word of-mouth
 
Accounting principles
Accounting principlesAccounting principles
Accounting principles
 
Event summarization using tweets
Event summarization using tweetsEvent summarization using tweets
Event summarization using tweets
 
Doppler effect experiment and applications
Doppler effect experiment and applicationsDoppler effect experiment and applications
Doppler effect experiment and applications
 

Similar a Questions about questions

Final Poster for Engineering Showcase
Final Poster for Engineering ShowcaseFinal Poster for Engineering Showcase
Final Poster for Engineering ShowcaseTucker Truesdale
 
Sentiment Analysis of Twitter Data
Sentiment Analysis of Twitter DataSentiment Analysis of Twitter Data
Sentiment Analysis of Twitter DataSumit Raj
 
Tweet segmentation and its application to named entity recognition
Tweet segmentation and its application to named entity recognitionTweet segmentation and its application to named entity recognition
Tweet segmentation and its application to named entity recognitionieeepondy
 
Twitter: Social Network Or News Medium?
Twitter: Social Network Or News Medium?Twitter: Social Network Or News Medium?
Twitter: Social Network Or News Medium?Serge Beckers
 
Twitter: Social Network Or News Medium?
Twitter: Social Network Or News Medium?Twitter: Social Network Or News Medium?
Twitter: Social Network Or News Medium?Serge Beckers
 
IRJET - Twitter Sentiment Analysis using Machine Learning
IRJET -  	  Twitter Sentiment Analysis using Machine LearningIRJET -  	  Twitter Sentiment Analysis using Machine Learning
IRJET - Twitter Sentiment Analysis using Machine LearningIRJET Journal
 
IRJET - Implementation of Twitter Sentimental Analysis According to Hash Tag
 IRJET - Implementation of Twitter Sentimental Analysis According to Hash Tag IRJET - Implementation of Twitter Sentimental Analysis According to Hash Tag
IRJET - Implementation of Twitter Sentimental Analysis According to Hash TagIRJET Journal
 
Lexicon Based Emotion Analysis on Twitter Data
Lexicon Based Emotion Analysis on Twitter DataLexicon Based Emotion Analysis on Twitter Data
Lexicon Based Emotion Analysis on Twitter Dataijtsrd
 
Tweet Segmentation and Its Application to Named Entity Recognition
Tweet Segmentation and Its Application to Named Entity RecognitionTweet Segmentation and Its Application to Named Entity Recognition
Tweet Segmentation and Its Application to Named Entity Recognition1crore projects
 
Text mining on Twitter information based on R platform
Text mining on Twitter information based on R platformText mining on Twitter information based on R platform
Text mining on Twitter information based on R platformFayan TAO
 
EMOTION RECOGNITION BY TEXTUAL TWEETS CLASSIFICATION USING VOTING CLASSIFIER ...
EMOTION RECOGNITION BY TEXTUAL TWEETS CLASSIFICATION USING VOTING CLASSIFIER ...EMOTION RECOGNITION BY TEXTUAL TWEETS CLASSIFICATION USING VOTING CLASSIFIER ...
EMOTION RECOGNITION BY TEXTUAL TWEETS CLASSIFICATION USING VOTING CLASSIFIER ...Shakas Technologies
 
A Baseline Based Deep Learning Approach of Live Tweets
A Baseline Based Deep Learning Approach of Live TweetsA Baseline Based Deep Learning Approach of Live Tweets
A Baseline Based Deep Learning Approach of Live Tweetsijtsrd
 
Sentiment Analysis of Twitter tweets using supervised classification technique
Sentiment Analysis of Twitter tweets using supervised classification technique Sentiment Analysis of Twitter tweets using supervised classification technique
Sentiment Analysis of Twitter tweets using supervised classification technique IJERA Editor
 
IRJET- Improved Real-Time Twitter Sentiment Analysis using ML & Word2Vec
IRJET-  	  Improved Real-Time Twitter Sentiment Analysis using ML & Word2VecIRJET-  	  Improved Real-Time Twitter Sentiment Analysis using ML & Word2Vec
IRJET- Improved Real-Time Twitter Sentiment Analysis using ML & Word2VecIRJET Journal
 
A MODEL BASED ON SENTIMENTS ANALYSIS FOR STOCK EXCHANGE PREDICTION - CASE STU...
A MODEL BASED ON SENTIMENTS ANALYSIS FOR STOCK EXCHANGE PREDICTION - CASE STU...A MODEL BASED ON SENTIMENTS ANALYSIS FOR STOCK EXCHANGE PREDICTION - CASE STU...
A MODEL BASED ON SENTIMENTS ANALYSIS FOR STOCK EXCHANGE PREDICTION - CASE STU...csandit
 
A MODEL BASED ON SENTIMENTS ANALYSIS FOR STOCK EXCHANGE PREDICTION - CASE STU...
A MODEL BASED ON SENTIMENTS ANALYSIS FOR STOCK EXCHANGE PREDICTION - CASE STU...A MODEL BASED ON SENTIMENTS ANALYSIS FOR STOCK EXCHANGE PREDICTION - CASE STU...
A MODEL BASED ON SENTIMENTS ANALYSIS FOR STOCK EXCHANGE PREDICTION - CASE STU...cscpconf
 
IRJET- Categorization of Geo-Located Tweets for Data Analysis
IRJET- Categorization of Geo-Located Tweets for Data AnalysisIRJET- Categorization of Geo-Located Tweets for Data Analysis
IRJET- Categorization of Geo-Located Tweets for Data AnalysisIRJET Journal
 

Similar a Questions about questions (20)

Final Poster for Engineering Showcase
Final Poster for Engineering ShowcaseFinal Poster for Engineering Showcase
Final Poster for Engineering Showcase
 
Sentiment Analysis of Twitter Data
Sentiment Analysis of Twitter DataSentiment Analysis of Twitter Data
Sentiment Analysis of Twitter Data
 
Tweet segmentation and its application to named entity recognition
Tweet segmentation and its application to named entity recognitionTweet segmentation and its application to named entity recognition
Tweet segmentation and its application to named entity recognition
 
Twitter: Social Network Or News Medium?
Twitter: Social Network Or News Medium?Twitter: Social Network Or News Medium?
Twitter: Social Network Or News Medium?
 
Twitter: Social Network Or News Medium?
Twitter: Social Network Or News Medium?Twitter: Social Network Or News Medium?
Twitter: Social Network Or News Medium?
 
IRJET - Twitter Sentiment Analysis using Machine Learning
IRJET -  	  Twitter Sentiment Analysis using Machine LearningIRJET -  	  Twitter Sentiment Analysis using Machine Learning
IRJET - Twitter Sentiment Analysis using Machine Learning
 
IRJET - Implementation of Twitter Sentimental Analysis According to Hash Tag
 IRJET - Implementation of Twitter Sentimental Analysis According to Hash Tag IRJET - Implementation of Twitter Sentimental Analysis According to Hash Tag
IRJET - Implementation of Twitter Sentimental Analysis According to Hash Tag
 
Lexicon Based Emotion Analysis on Twitter Data
Lexicon Based Emotion Analysis on Twitter DataLexicon Based Emotion Analysis on Twitter Data
Lexicon Based Emotion Analysis on Twitter Data
 
Tweet Segmentation and Its Application to Named Entity Recognition
Tweet Segmentation and Its Application to Named Entity RecognitionTweet Segmentation and Its Application to Named Entity Recognition
Tweet Segmentation and Its Application to Named Entity Recognition
 
Text mining on Twitter information based on R platform
Text mining on Twitter information based on R platformText mining on Twitter information based on R platform
Text mining on Twitter information based on R platform
 
EMOTION RECOGNITION BY TEXTUAL TWEETS CLASSIFICATION USING VOTING CLASSIFIER ...
EMOTION RECOGNITION BY TEXTUAL TWEETS CLASSIFICATION USING VOTING CLASSIFIER ...EMOTION RECOGNITION BY TEXTUAL TWEETS CLASSIFICATION USING VOTING CLASSIFIER ...
EMOTION RECOGNITION BY TEXTUAL TWEETS CLASSIFICATION USING VOTING CLASSIFIER ...
 
A Baseline Based Deep Learning Approach of Live Tweets
A Baseline Based Deep Learning Approach of Live TweetsA Baseline Based Deep Learning Approach of Live Tweets
A Baseline Based Deep Learning Approach of Live Tweets
 
Sentiment Analysis of Twitter tweets using supervised classification technique
Sentiment Analysis of Twitter tweets using supervised classification technique Sentiment Analysis of Twitter tweets using supervised classification technique
Sentiment Analysis of Twitter tweets using supervised classification technique
 
IRJET- Improved Real-Time Twitter Sentiment Analysis using ML & Word2Vec
IRJET-  	  Improved Real-Time Twitter Sentiment Analysis using ML & Word2VecIRJET-  	  Improved Real-Time Twitter Sentiment Analysis using ML & Word2Vec
IRJET- Improved Real-Time Twitter Sentiment Analysis using ML & Word2Vec
 
F017433947
F017433947F017433947
F017433947
 
unit-5.pdf
unit-5.pdfunit-5.pdf
unit-5.pdf
 
A MODEL BASED ON SENTIMENTS ANALYSIS FOR STOCK EXCHANGE PREDICTION - CASE STU...
A MODEL BASED ON SENTIMENTS ANALYSIS FOR STOCK EXCHANGE PREDICTION - CASE STU...A MODEL BASED ON SENTIMENTS ANALYSIS FOR STOCK EXCHANGE PREDICTION - CASE STU...
A MODEL BASED ON SENTIMENTS ANALYSIS FOR STOCK EXCHANGE PREDICTION - CASE STU...
 
A MODEL BASED ON SENTIMENTS ANALYSIS FOR STOCK EXCHANGE PREDICTION - CASE STU...
A MODEL BASED ON SENTIMENTS ANALYSIS FOR STOCK EXCHANGE PREDICTION - CASE STU...A MODEL BASED ON SENTIMENTS ANALYSIS FOR STOCK EXCHANGE PREDICTION - CASE STU...
A MODEL BASED ON SENTIMENTS ANALYSIS FOR STOCK EXCHANGE PREDICTION - CASE STU...
 
P1803018289
P1803018289P1803018289
P1803018289
 
IRJET- Categorization of Geo-Located Tweets for Data Analysis
IRJET- Categorization of Geo-Located Tweets for Data AnalysisIRJET- Categorization of Geo-Located Tweets for Data Analysis
IRJET- Categorization of Geo-Located Tweets for Data Analysis
 

Último

Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 

Último (20)

Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 

Questions about questions

  • 1. QUESTIONS ABOUT QUESTIONS: AN EMPIRICAL ANALYSIS OF INFORMATION NEEDS ON TWITTER
  • 2. ABSTRACT  We take the initiative to extract and analyze information needs from billions of online conversations collected from Twitter.  We can accurately detect real questions in tweets  We then present a comprehensive analysis of the large-scale collection of information needs we extracted.
  • 3. INTRODUCTION  13% of a random sample of tweets were questions. 1. Broadcasting 2. Targeting the question to particular friends.  Information needs through social platforms present a higher coverage of topics related to human interest, entertainment, and technology, compared to search engine queries.
  • 4.
  • 5. CONTRIBUTION  We present the first very large scale and longitudinal study of information needs in Twitter.  We prove that information needs detected on Twitter have a considerable power of predicting the trends of search engine queries.  Through the in-depth analysis of various types of time series, we find many interesting patterns related to the entropy of language and bursts of information needs.
  • 6. EXPERIMENT SETUP  The collection covers a period of 358 days, from July 10th, 2011 to June 31st 2012.  4,580,153,001 tweets  We focus on tweets that contain at least one question mark.  81 .5% of information needs asked through social platforms were explicitly phrased as questions and included a question mark.  In our collection of tweets, 10.45 % of tweets contain explicit appearance of question mark(s).
  • 7. DETECTING INFORMATION NEEDS  A text classification problem . 1. Give a formal definition of this problem and generate a set of labeled tweets as training/testing examples. 2. Introduce a classifier trained with these examples, using the state-of-the-art machine learning algorithms and a comprehensive collection of features .
  • 8. DEfiNITION AND RUBRICS  “real questions” :  A tweet conveys an information need, or is a real question, if it expects an informational answer from either the general audience or particular recipients. 1. it requests for a piece of factual knowledge, or a confirmation of a piece of factual knowledge 2. it requests for an opinion, idea, preference, recommendation, or personal plan of the recipient(s), as well as a confirmation of such information .
  • 9. HUMAN ANNOTATION     two human annotators sampled 5,000 tweets 3,119 tweets are labeled as real tweets 1 ,595 are labeled as conveying an information need and 1 ,524 are labeled not conveying an information need  The inter-rater reliability measured by Cohen’s kappa score is 0.8350
  • 10. TEXT CLASSIFICATION Feature Extraction  Lexical features / the semantic knowledge base WordNet / syntactical features  four dif ferent types of feature from each tweet: lexical ngrams, synonyms and hypernyms of words(obtained from the WordNet), ngrams of the part-of-speech (POS) tags, and light metadata and statistical features such as the length of the tweet and coverage of vocabulary,etc..
  • 11. TEXT CLASSIFICATION  Lexical Features  We included unigrams, bigrams, as well as trigrams .  For example tweets beginning with the 5Ws(who, when, what, where, and why) are more likely to be real questions.  44,121lexicalfeatures.
  • 12. TEXT CLASSIFICATION  WordNet Features  synonyms  hypernyms  By doing this, our algorithm can also handle words that haven’t been seen in the training data.  23,277 WordNet features are extracted
  • 13. TEXT CLASSIFICATION  Part-of-Speech Features  Capture light syntactic information . 1. given a tweet with n words, w 1 ;w 2 ; … ;w n , we extract grams from the part -of-speech sequence of the tweet, is t 1 ;t 2 ;…;t n , 2. Extract unigrams, bigrams and trigrams from this part of-speech sequence as additional features of the tweet .  3,902 POS features are extracted in total
  • 14. TEXT CLASSIFICATION  Meta Features  6 meta data features and simple statistical features of the tweet  such as the length of the tweets, the number of words, the coverage of vocabulary, the number of capitalized words, whether or not the tweet contains a URL, and whether or not it mentions other users.
  • 15. TEXT CLASSIFICATION Feature Selection  Reduce the dimensionality of the data  Bi-Normal Separation  tpr = tp= ( tp + fn )  fpr = fp= ( fp + tn )  F is the Normal cumulative distribution function
  • 16. TRAINING CLASSIFIER 1. train four independent classifiers using the Support Vector Machine 2. combine the four classifiers that represent four types of features into one stronger classier using boosting, Adaptive Boosting  Adaboost is an ef fective algorithm that trains a strong classifier based on several groups of weak classifiers.  AdaBoost方法是一种迭代算法,在每一轮中加入一个新的弱分类 器,直到达到某个预定的足够小的错误率。每一个训练样本都被赋 予一个权重,表明它被某个分类器选入训练集的概率。如果某个样 本点已经被准确地分类,那么在构造下一个训练集中,它被选中的 概率就被降低;相反,如果某个样本点没有被准确地分类,那么它 的权重就得到提高。
  • 17. TRAINING CLASSIFIER  After several iterations, when the combination of weak classifiers starts to achieve a higher performance, the diversity inside the combination is getting lower.  add a parameter to control for the diversity of the weak learners in each iteration .  The diversity that a new classifier could add in iteration t is defined as follows:
  • 18.  The diversity of a classier represents how much new information it could provide to a group of classifiers that have already been trained in Adaboost.
  • 19. EVALUATION OF THE CLASSIFIER
  • 20. EVALUATION OF THE CLASSIFIER
  • 21. EVALUATION OF THE CLASSIFIER  The accuracy of the classifier improved from 85.6% to 86.6%  The small margin suggests that the lexical features are strong enough in detecting information needs, while other types of features add little to the success.
  • 22. ANALYZING INFORMATION NEEDS  136,841,672 tweets conveying information need between July 10th 2011 to June 31st 2012.  This is roughly a proportion of 3% of all tweets, and 28.6% of tweets with question marks.
  • 23. GENERAL TREND  first 5 months  we normalize the time series so that the two curves are easier to be aligned on the plot
  • 27. BURSTINESS  we adopt a straightforward solution to detect similar burst events in the time series of information needs and the background.
  • 29.
  • 31.
  • 32. CONCLUSION  we present the first large-scale analysis of information needs, or questions, in Twitter.  We proposed an automatic classification algorithm that distinguishes real questions from tweets with question marks  We then present a comprehensive analysis of the large-scale collection of information needs we extracted.
  • 34. ABSTRACT  To predict whether a user that attended her first session in a particular Twitter chat group will return to the group, we build 5F Model that captures five different factors: individual initiative, group characteristics, perceived receptivity, linguistic affinity, and geographical proximity .
  • 35.  The research question we investigate is what factors ensure continued individual participation in a Twitter chat.
  • 36. 5F MODEL  Individual Initiative 1. user tweetcount denotes the number of tweets the user contributes to the session. 2. userurl denotes the number of urls the user contributes to the chat session. 3. usermentions is the total number of times the user mentions another (by using @). 4. userretweets is the number of retweets by the newcomer user and captures the amount of information she found to be worth sharing with her followers.
  • 37. 5F MODEL  Group Characteristics
  • 38. 5F MODEL  Perceived Receptivity
  • 39. 5F MODEL  Linguistic Af finity  We make use of Linguistic Inquir y and Word Count ( LIWC) to compare linguistic markers between a user and a group .  LIWC is a text analysis software that calculates the degree to which people use dif ferent categories of words across a wide array of texts  We consider the set of tweets a user ui shares in her first session as a text document and compute the value of each linguistic marker to obtain her LIWC-vector for that particular session.
  • 40. GEOGRAPHIC PROXIMIT Y  the distance d (in meters) between two users ui and uj
  • 41. DATA SET  Group Chats Studied  June 2010-July 2012.  By identifying the hours of high activity,we capture the sessions for each chat
  • 42.
  • 43. SALIENT STATISTICS  Distribution of the number of users in and outside chat sessions:
  • 44. SALIENT STATISTICS  Distribution of the number of distinct chats users  participate in:
  • 45. SALIENT STATISTICS  Degree distribution of education chat users:
  • 46. SALIENT STATISTICS  Geographical distribution of education chat users
  • 48. USER SURVEY  an online survey of 26 questions
  • 49. USER SURVEY  Usage, Advantages and Disadvantages
  • 51. CONCLUSIONS  We developed 5F Model that predicts whether a person attending her first chat session in a particular Twitter chat group will return to the group.  We performed statistical data analysis for thirty educational Twitter chats involving 71411 users and 730944 tweets over a period of two years .  We also complemented the results of statistical analysis with a survey study.