The document describes a system called qCrowd that identifies willing and able strangers on social media to answer questions. It analyzes features like past response rates, communication levels, and activity levels to predict response likelihood. An algorithm recommends subsets of people to maximize response rates. Experiments show the approach improves rates over baselines. Future work includes applying it to other platforms and information collection applications.
1. Recommending Targeted Strangers for
Answering Questions on Social media
Jalal Mahmud, Michelle Zhou, Nimrod Megiddo, Jeffrey Nichols, Clemens Drews
IBM Research – Almaden
San Jose, CA
2. The Buzz of the Crowd
Hundreds of millions of people express
themselves on social media daily
– Location-based information
– Status update
– Sentiment about products or services
The buzz of the crowd creates a unique
opportunity for building a new type of crowd-
powered information collection systems.
Such systems will actively identify and engage
the right people at the right time on social
media to elicit desired information. E.g.
- current wait time at a restaurant
- airport security wait time
- information in an emergency situation.
Our initial experiment when we manually
selected strangers based on their ability to
answer questions achieved 42% response rate
[Nichols et el. 2012]
•400+ million tweets daily
•3.2 billion Facebook likes
and comments daily
3. Our System - qCrowd
Monitors twitter stream to
identify relevant posts
Monitors twitter stream to
identify relevant posts
Generates questions and
sends them to each selected
person
Generates questions and
sends them to each selected
person
Analyzes received responses
and synthesizes the answers
together
Analyzes received responses
and synthesizes the answers
together
Evaluates the authors of
identified posts and
recommends a sub-set of
people to engage
Evaluates the authors of
identified posts and
recommends a sub-set of
people to engage
How to identify strangers who are willing, able and ready to provide requested
information?
- Ability to provide information is domain dependent such as being at a location or
knowledge about a product/service and hence we use a set of rules to determine ability.
4. Key Contributions
Features
- Set of features that are likely to impact one’s willingness and readiness to
respond
Prediction of Response Likelihood
- A statistical model to infer the contribution of each feature to one’s willingness
and readiness, which are used to predict one’s likelihood to respond.
Recommendation Algorithm
- A recommendation algorithm that automatically selects a set of targeted
strangers to maximize the overall response rate of an information request.
Effectiveness
- Demonstrated effectiveness in real world scenarios and insights for building new
class of crowd-powered intelligent information collection systems.
5. Outline
Background - Buzz of the Crowd
Our System - qCrowd
Key Contributions
Active Engagement and Data Collection
Baselines
Features
Statistical Model & Recommendation Algorithm
Evaluation
Summary and Future Work
6. Active Engagement and Data Collection
TSA-tracker Question Datasets:
Our first two data sets were obtained
in the process of collecting location-
based information—airport security
check wait time via Twitter.
- @bbx If you went through security at JFK,
can you reply with your wait time? Info will
be used to help other travelers.
Product Question Dataset:
Collected by asking people on Twitter describing their product/service experience.
- @johnny Trying to learn about tablets...sounds like you have Galaxy Tab 10.1. How fast
is it?
Domain # of
Questions
# of
Responses
Response
Rate
TSA-tracker-1 589 245 42%
TSA-tracker-2 409 134 33%
Product 1540 474 31%
7. Baseline: Asking Random Strangers
Sent questions to random people on Twitter.
@needy Doing a research about your local public safety.
Would you be willing to answer a related question?
@john Doing a survey about your local school system.
Would you be willing to answer a related question?
@dolly Collecting local weather data for a research.
Would you tell us what your local weather was last week?
Domain # of
Questions
# of
Responses
Response
Rate
Weather 187 7 3.7%
Public Safety 178 6 3.4%
Education 101 3 3.0%
It is ineffective to ask random strangers on social media without
considering their willingness, ability, or readiness to answer
8. Baseline: Crowd As Human Operator
Crowd-sourcing a human operator’s task to test a crowd’s ability to
identify the right targets.
We conducted two surveys on CrowdFlower – a crowd-sourcing
platform.
- Willingness Survey: Asked each participant to predict if a displayed Twitter
user would be willing to respond to a given question, assuming that the user
has the ability to answer.
- Readiness Survey: Asked each participant to predict how soon the person
would respond assuming that s/he is willing to respond.
The participants were also required to provide an explanation of
their predictions.
We wanted to know what criteria a crowd would use to identify
targeted strangers.
9. Willingness Survey
Randomly picked 200 users from each of our datasets:
- 100 participant from CrowdFlower.
- Each participant was given 2 randomly selected users for judgment.
- Participants were asked to predict if the displayed Twitter user would respond.
- Compared with Twitter user’s actual response (responded/not-responded).
Correctness:
- 29% correct when only tweets of a user was displayed.
- 38% correct when complete twitter profile was displayed.
- The task of selecting users for question asking is also difficult for the crowd.
Top Predictors:
- Past responsiveness and interaction behavior (57.6%)
"The user seems extremely social, both asking questions and replying to others“.
- Profile information (10.45%)
"Because him being a social media guy and his tagline saying “we should hang out”.
- Personality (7.4%)
"I think he won’t respond. Doesn't seem to be very friendly” .
- Retweeting behavior (6%)
“No. Most of the tweets are retweets instead of anything personal”.
- General tweeting activity (10.45%)
“This user tweets a lot, seems very chatty”.
10. Readiness Survey
Participants Judged how soon a person would respond to an information
request, assuming that the person would respond.
- Used a multiple choice question with varied time windows as choices.
- Randomly selected 100 people from our collected datasets.
- Recruited 50 participants on CrowdFlower.
- Each of them was given two randomly chosen people and their
twitter handlers.
Computed prediction correctness
- Compared with ground truth.
- For example, if a participant predicted that person X will respond within an hour, but the
response was not received in time, the prediction is then incorrect.
Correctness:
- 58% correct in making prediction.
Top Predictors:
- Activeness and Steadiness of Twitter usage: 25% .
- Promptness of Response: 30% .
11. Key Features for Selection of Strangers
Responsiveness Features
Responsiveness Features Computation
Mean Response Time Avg(T), T denote previous response times
Median Response Time Med(T), T denote previous response times
Mode Response Time Mod(T), T denote previous response times
Max Response Time Max(T), T denote previous response times
Min Response Time Min(T), T denote previous response times
Past Response Rate NR/ND, NR is the number of the user’s responses and ND is the
number of direct questions the user was asked in Twitter.
Proactiveness NR/NI, NR is the number of user’s responses and NI is the
number of indirect questions the user was asked in Twitter.
We hypothesize that one’s willingness to respond to questions is
related to one’s past response behavior.
a
12. Key Features for Selection of Strangers
Profile Features
We use a profile-based
CountSocialWords - Count of the following phrases in description field in user profile:
{“social”, “social media”, “social network”, “social networking”, “friend”, “tweet”,
“twitter”, “tweeting”, “tweets”, “tell”, “telling”, “talk”, “talking”, “communication”,
“communicator”}.
- Adopted from LIWC “social process” category and by observing words related to
modern social network activity.
- The intuition is that a user who has such words in her profile will be more active and
engaging than others, hence may likely to respond.
MsgCount - Number of status messages
DailyMsgCount - Number of status messages per day
Retweet Features
RetweetRatio – Ratio of the total number of retweets and total number of tweets
DailyRetweetCount - Ratio of the total number of retweets and total number of days
since the account is created.
Activity Features
13. Key Features for Selection of Strangers
Personality Features
Personality traits such as Friendliness and Extraversion are intuitively related
with one’s willingness to respond to questions.
Previous researchers have shown that word usage in one’s writings such as blogs
and essays is related with one’s personality.
Personality
Features
Total
Number
Examples Computation
LIWC 68 Comm@Communication
[admit, advice, affair*,
apolog*, …]
Let g be a LIWC category, Ng denotes the
number of occurrences of words in that
category in one’s tweets and N denotes the
total number of words in his/her tweets. A
score for category g is then: Ng/N.
Big Five 5 Extraversion Using correlations with LIWC features as
reported by previous researchers (e.g.,
Yarkoni et al.)
Big Five
Facets
30 Friendliness, Anxiety Using correlations with LIWC features as
reported by previous researchers (e.g.,
Yarkoni et al.)
14. Key Features for Selection of Strangers
Readiness Features
Even if a person is willing to respond to questions, he/she may not be ready to
respond at the time of questioning.
Since one’s readiness is highly context dependent (e.g. mobile device in use to
send answers is running out of battery) and often difficult to capture
computationally, we use several features to approximate one’s readiness:
Readiness Features Computation
Tweeting Likelihood of the Day TD/N, where TD is the number of tweets sent by the user
on day D and N is the total number of tweets.
Tweeting Likelihood of the
Hour
TH/N, where TH is the number of tweets sent by the user
on hour H and N is the total number of tweets.
Tweeting Steadiness 1/σ, where σ is the standard deviation of the elapsed time
between consecutive tweets of users, computed from
users’ most recent K tweets (where K is set, for example,
to 20).
Tweeting Inactivity TQ - TL, where TQ is the time the question was sent and
TL is the time the user last tweeted.
15. Feature Analysis
Significant Features
- Statistical significance test was done using Chi-square test with Bonferroni correction.
- For TSA-tracker-1 dataset, we found 42 significant features (FDR was 2.8%).
- For Product dataset, we found 31 features as significant (FDR 4.2%)
- For TSA-tracker-2 dataset, we found 11 significant features (FDR 11.2%)
Top-4 Features
Features Feature-type
Communication personality
Past response rate responsiveness
tweeting inactivity readiness
tweeting likelihood of the
day
readiness
- Top-4 features were found using
extensive experiments.
Dataset Top ten statistically significant features
TSA-
tracker-1
Past Response Rate, Tweet Inactivity, Negative
Emotions, Cautiousness, Depression,
Excitement-Seeking, DailyMsgCount, Intellect,
Communication, Immoderation
TSA-
tracker-2
Prepositions, Past, Exclusion, Sensation, Past
Response Rate, Space, Tweeting Steadiness,
Achievement-striving, Agreeableness,
CountSocialWords
Product Mode Response Time, Tweet Inactivity,
Activity Level, Depression, Present,
Cautiousness, Positive Emotion, Excitement-
Seeking, DailyMsgCount, Past Response Rate
16. Statistical Model & Recommendation
Binary-classification
Classify a person as responder or non-responder and send questions to
people who are classified as responders.
Top-K Selection
Rank according to their probabilities computed by the statistical model and
select top-K people to send questions.
Our Recommendation Algorithm
Automatically selects a subset of people from a set of available people.
Such subset selection is designed with the goal of maximizing the response rate.
Once features are computed, we trained statistical model such as Support
Vector Machine and Logistic Regression to predict likelihood of response.
Statistical Model
17. Recommendation Algorithm
Ranks people in the training set in the order of non-decreasing
probabilities.
Finds the best interval that has the maximum response rate among all
interval subsets in the linear order.
Computes best subinterval in the test set from the best subinterval in
training set using simple linear projection.
Can apply various constraints on selecting minimum/maximum/exact
number of people.
- select at-least/at-most/exactly K% people
18. Evaluation
Evaluating Prediction Model
TSA-tracker-1 TSA-tracker-2 Product
SVM Logistic SVM Logistic SVM Logistic
Precision 0.62 0.60 0.52 0.51 0.67 0.654
Recall 0.63 0.61 0.53 0.55 0.71 0.62
F1 0.625 0.606 0.525 0.53 0.689 0.625
AUC 0.657 0.599 0.592 0.514 0.716 0.55
5-fold cross validation experiments
AUC – Area Under ROC Curve
F1 = Harmonic mean of precision and recall
Our models are 60-70% correct in making a prediction.
19. Evaluating Recommendation Algorithm
Comparison of Average Response Rates using Different Approaches
TSA-tracker-1 TSA-tracker-2 Product
Baseline 42% 33% 31%
Binary-classification 62% 52% 67%
Top-K-Selection 61% 54% 67%
Our Algorithm 67% 56% 69%
Baseline is the response rates achieved by a human operator during data collection
Used “asking at least K% of people from the original set” as a constraint to
search for the interval that maximizes the response rate.
Computed the response rates on with varied K (e.g., K=5%, …,90%) to find the respective
optimal intervals.
Computed the response rates achieved using a simple binary classification (response rate
is the precision of the predictive model) and simply selecting the top- K (e.g., K=5%, …,
90%.) people by their computed probabilities.
20. Recall of Recommendation
Selecting K% people Response Rate Recall
25% 76% 37%
50% 68% 64%
75% 53% 82%
100% 31% 100%
Response Rate and Recall for Our Algorithm with Fixed Size (Product Data, all features, SVM Model)
Trade-off between response rate and recommendation recall, which
captures the ratio of the actual responders our algorithm identifies for
sending questions.
21. Use of Different Feature Sets
Feature Set Response Rate
TSA-tracker-1 TSA-tracker-2 Product
All 0.79 0.72 0.78
Significant 0.83 0.75 0.82
Top-10 Significant 0.83 0.74 0.81
Top-4 features 0.82 0.73 0.83
Common Significant features 0.81 0.72 0.82
Selecting at-least 5% people, SVM Model
Comparison of Average Response Rates
22. Live Experiments
Used Twitter’s Search API and a set of rules to find 500 users who
mentioned that they were at any US airport in their tweets.
- Randomly asked 100 users for the security wait time.
- Used our algorithm to identify 100 users for questioning from the remaining 400 users.
- Used the SVM-based model with the identified significant features.
- Waited 48 hours for the responses.
The same process was repeated for sending product questions.
Large improvement of response rate in a live setting.
Live Experiment Random Selection Our Algorithm
TSA-Tracker-1 29% 66%
Product 26% 60%
23. Summary & Future Work
We focused on modeling users willingness and readiness to answer questions.
We can predict one’s likelihood of response to questions and identified sub-set
of features that have significant prediction power.
Our experiments including the live one in a real-world setting demonstrated our
approach’s effectiveness in maximizing the response rate.
Future Work
- Applicability
- Apply to other social media platforms
- Apply to other information collection applications
- Handling Skew in the user base
- Identify inactive users similar to active users in terms of personality
- Modeling the fitness of a stranger to engage
- Develop model to receive high quality response
- Model dutifulness and trustworthiness of users
- Handling complex situations
- Incorporate various costs/benefits which might change over time
- Develop model to maximize expected net benefit
- Handling unexpected answer
- Incorporate voluntary responses from people in social media
- Grow potential targets
- Protecting Privacy
- Tune selection algorithm to exclude people who are concerns about privacy
Such as celebrating birthdays, attending weddings, and becoming parents. Just by the sheer volume of activities on social media, we can certainly feel the heat and power of social media.
1 min.
1 min.
1 min.
30 sec.
So far: 4 min.
1 min.
So far: 5 min.
30 sec.
30 sec.
1 min.
30 sec.
So far: 8 min.
30 sec.
30 sec.
1 min.
1 min.
30 sec.
From survey to feature analysis: 6 min.
So far 5 + 6 = 11 min/12 min.
30 sec
30 sec
Statistical model + recommendation = 1 min.
So far 12 + 1 = 13 min => 14 min.
30 sec.
1 min.
30 sec.
30 sec.
1 min.
From Slide 18-21 = 2 min
Total = 14 + 2 + 1 = 17 min.
1 min.
From Slide 17-21 = 3.5 min -> 4 min.