Invited keynote talk at the 1st Workshop of Quality, Motivation and Coordination of Open Collaboration @ the International Conference on Social Informatics 2013
From User Needs to Community Health: Mining User Behaviour to Analyse Online Communities
1. FROM USER NEEDS TO
COMMUNITY HEALTH:
MINING USER BEHAVIOUR TO
ANALYSE ONLINE
COMMUNITIES
DR. MATTHEW ROWE
SCHOOL OF COMPUTING AND COMMUNICATIONS
@MROWEBOT | M.ROWE@LANCASTER.AC.UK
Invited Talk @ 1st Workshop on Quality, Motivation and Coordination,
International Conference on Social Informatics 2013. Kyoto, Japan
2. About Me
1
2002-2006: M.Eng Software Engineering
2006-2010: Ph.D. Computer Science
2010-2012: Postdoc Research Associate
2012-now: Lecturer in Social Computing
Undergrad
Postgrad
Postdoc
Lecturing
Time
From User Needs to Community Health: Mining User Behaviour to Analyse Online Communities
3. Research Interests
2
Semantics
Social networks
Digital Identity
Data
Forecasting + Classification
Data Mining
Disambiguation
Automating Processes
Modelling Social Systems
Artificial Intelligence
Machines
Prediction
http://scholar.google.com/citations?user=rhyR4_kAAAAJ
From User Needs to Community Health: Mining User Behaviour to Analyse Online Communities
4. Collaborators
3
Harith Alani. Senior Lecturer, Knowledge Media Institute,
The Open University, UK.
http://people.kmi.open.ac.uk/harith/
Miriam Fernandez. Research Associate, Knowledge
Media Institute, The Open University, UK.
http://kmi.open.ac.uk/people/member/miriamfernandez
Conor Hayes. Senior Research Fellow, Digital Enterprise
Research Institute, Galway, Ireland.
http://www.deri.ie/users/conor-hayes
Marcel Karnstedt. Senior Postdoctoral Researcher, Digital
Enterprise Research Institute, Galway, Ireland.
http://www.marcel.karnstedt.com/
From User Needs to Community Health: Mining User Behaviour to Analyse Online Communities
5. Outline
4
¨
Part I: Online Communities and User Behaviour
¤
define: online communities, user behaviour!
¤ The
¨
potential for examining user behaviour
Part II: Comparing User Behaviour and User Needs
¤ Collecting
users’ needs in online communities
¤ Linking needs to behaviour
¨
Part III: Predicting Community Health from User
Behaviour
¤ Mining
roles from user behaviour
¤ Community health forecasting from collective behaviour
From User Needs to Community Health: Mining User Behaviour to Analyse Online Communities
6. 5
Part I: Online Communities and User
Behaviour
From User Needs to Community Health: Mining User Behaviour to Analyse Online Communities
7. Defining Online Communities
6
a)
Distinct user containers in which users discuss a
given topic
¤ E.g.
message board forums
¤ E.g. question-answering systems
b)
Latent grouping of users by some common
attribute
¤ E.g.
semantic web community
¤ E.g. social network clusters with high social homophily
¨
This talk focuses on: a) User containers
From User Needs to Community Health: Mining User Behaviour to Analyse Online Communities
8. BT (British telecommunications firm) use online
communities to enable consumers to provide
support to other consumers
BBC News web site provides comments sections to
encourage user engagement with the news
Question-answering systems allow communities of
‘knowledgeable’ users to ask questions and
provide answers
7
From User Needs to Community Health: Mining User Behaviour to Analyse Online Communities
9. Why Provide Online Communities?
8
Increase
Customer
Loyalty
Understanding
Product Issues
Facilitating
Idea
Generation
Raising Brand
Awareness
Spreading
through Word
of Mouth
From User Needs to Community Health: Mining User Behaviour to Analyse Online Communities
10. Managing Online Communities
9
¨
Online communities incur significant investments:
¤ Hosting
n Cost
and bandwidth:
(time + money) grows linearly with popularity
¤ Community
management:
n Settling
disputes
n Encouraging engagement within the communities
¨
Common questions arise:
¤ ‘How
do I know if my community is healthy?’
¤ ‘What changes in the community lead to it becoming
unhealthy’?
From User Needs to Community Health: Mining User Behaviour to Analyse Online Communities
11. How do I know if my community is
‘healthy’?!
10
¨
Approach 1: Needs Satisfaction
¤
¤
¨
Identify users’ needs for the community
Analyse users to see if their needs have been met
Approach 2: Numerical Health Measures
¤ Determine
suitable measures for community health (e.g.
churn rate)
¤ Analyse these measures over time to see if the
community is remaining healthy, or not
From User Needs to Community Health: Mining User Behaviour to Analyse Online Communities
12. Analysing User Behaviour
11
¨
Online communities are behavioural ecosystems
¤ Prevalent
user behaviour can impact the behaviour of
other users (Preece. 2000)
‘the way’
‘tangible measures derived
from actions performed by and
upon a user’
From User Needs to Community Health: Mining User Behaviour to Analyse Online Communities
13. Behaviour Features
User
Post
Forum
‘tangible measures of actions
performed by and upon a user’
Initiation
¨
¤
The extent to which users begin discussions in a community
Contribution
¨
¤
The extent to which the user is providing content
Popularity
¨
¤
Proportion of the community that responds to the user
Engagement
¨
¤
Proportion of the community that a user responds to
Focus Dispersion
¨
¤
Variance of the user’s interests across topics
Quality
¨
¤
Reception of the user’s content by other users
12
From User Needs to Community Health: Mining User Behaviour to Analyse Online Communities
+1
+1
14. 13
Part II: Comparing User Behaviour and
User Needs
From User Needs to Community Health: Mining User Behaviour to Analyse Online Communities
15. Maslow’s Hierarchy of Needs
14
How does this hierarchy resonate with online community users?
From User Needs to Community Health: Mining User Behaviour to Analyse Online Communities
16. User Needs in Online Communities
15
¨
Users have different needs for participating in an
online community:
¤ To
create content and share information
¤ To communicate with other users
¤ To ask questions
¤ To collaborate with other users
¤ To help other users resolve problems and issues
¤ To discuss ideas
¨
We wanted to find out how important the above
needs were to community users…
From User Needs to Community Health: Mining User Behaviour to Analyse Online Communities
17. Dataset 1:
¨
Enterprise social software suite
¤ Communities
¨
within enterprises
Anonymised dataset (Jan 2010 -> April 2011)
¤ #Communities
of Practice (CoP): 100
¤ #Team Communities (Team): 72
¤ #Technical Support (Tech): 14
¨
Labels provided by (Muller et al. 2012)
16
From User Needs to Community Health: Mining User Behaviour to Analyse Online Communities
18. 17
Understanding Users’ Needs on IBM
Connections
¨
Surveyed 186 users about their needs
¤ Spanning
the aforementioned typed communities
¤ 150 responses
Likert scale (1-5) for agreement with statements
¨ Examples included:
¨
¤ How
often do you do the following?
n Browse
¤ Rate
for information, Search for information, etc.
how important the community features are to you?
n Receiving
recommendations, ability to filter information, etc.
From User Needs to Community Health: Mining User Behaviour to Analyse Online Communities
19. Users Needs on IBM Connections
18
Ranked Community Features:
D3.1: Report on Social, Technical and Corporate Needs in Online Communities. M Rowe, H Alani,
S Angeletou and G Burel. ROBUST Deliverable 3.1. (2012)
From User Needs to Community Health: Mining User Behaviour to Analyse Online Communities
20. Users Needs on IBM Connections
19
D3.1: Report on Social, Technical and Corporate Needs in Online Communities. M Rowe, H Alani,
S Angeletou and G Burel. ROBUST Deliverable 3.1. (2012)
From User Needs to Community Health: Mining User Behaviour to Analyse Online Communities
21. differtypes
5. We
nd the
vel of
n. As
mmuarticbution
sis to
w that
lving
pes in
n and
e. For
alue t type
in the other communities. Popularity is higher in Team and
Tech communities, but not significantly, than in CoP, suggesting that although users of the latter community provide
more contributions, it is with content published by fewer
users. For Engagement the mean is significantly highest - at
< 0.001 - for Team indicating that users tend to participate
¨ Measured the behaviour of users across the three
with more users in these communities than the others.
User Behaviour on IBM Connections
20
IBM Connections community types
Table 2. Mean and Standard Deviation (in parentheses) of the distribuStandard deviation
Mean of of micro features within the different community types
tion the behaviour feature
Feature
CoP
Team
Tech
Focus Dis’
1.682 (1.680)
1.391 (1.581)
1.382 (1.534)
Initiation
7.788 (21.525)
13.235 (23.361)
3.088 (6.676)
Contribution 26.084 (77.607) 21.130 (72.298) 11.753 (17.182)
Popularity
1.660 (3.647)
2.302 (2.900)
2.286 (3.920)
Engagement
1.016 (1.556)
1.948 (2.324)
1.036 ( 1.575)
We induce an empirical cumulative distribution function (ECDF)
for each across different types of Enterprise Online Communities. M Rowe, M
Behaviour analysis micro feature within each community and then qualitatively analyse Hayes and curves of the functions the Web Science
Fernandez, H Alani, I Ronen, C how the M Karnstedt. In the proceedings ofdiffer across
Conference. to Community Health: Mining User Behaviour in the case of Figure 3 we see
communities. For
From User NeedsEvanston, US. (2012) instance, to Analyse Online Communities
that for Focus Dispersion Tech communities have the high-
23. Linking Users Needs to User Behaviour
22
Questionnaire questions related to different
behaviour aspects (initiation, contribution, etc.)
¨ Mapped questions to these aspects:
¨
¤ E.g.
Initiation questions included:
n How
often do you ask a question?
n How often do you create content?
n How often do you announce work events and news?
¨
Resulted in average likert-scale value response per
behaviour aspect across community types
From User Needs to Community Health: Mining User Behaviour to Analyse Online Communities
24. tries.
type responses - e.g. taking the mean of the responses for all
pre- Initiation questions for the 95 CoPs. The set of results can
The mean of the third micro feature, Contribution, is higheach
be seen in Table 4.
ck of
ars to
iffertypes
. We
d the
vel of
n. As
mmuarticution
sis to
w that
lving
est for CoPUsers Needs tohigher than the others) inLinking (but not significantly User Behaviour
Table 4. Mean andmore initiated content is interacted with than
dicating that standard deviation (in parentheses) values of micro23
features obtainedcommunities. Popularity is higher in Team and
in the other using the questionnaires for the different community
User
types
Tech Needs from Questionnairesignificantly, than in CoP, sugcommunities, but not Responses:
CoP
Tech
gesting that although users of the Team community provide
latter
Focus Dis’
4.019
more contributions, (0.093) 3.055 (0.426) 4.070 by fewer
it is with content published (0.070)
Initiation
2.483 (0.838) 2.587 (0.838) 2.243 (0.873)
users. For Engagement the mean is (1.016) 3.158 (0.945) at
Contribution 3.239 (0.926) 3.202 significantly highest < 0.001
Team indicating that users 2.104 (0.173)
Popularity - for2.875 (0.070) 3.084 (0.168)tend to participate
with more users in these communities than the others.
Engagement 2.844 (0.539) 3.027 (0.588) 2.406 (0.522)
Table 2. Mean and Standard Deviation (in parentheses) of the distribu-
Observed User Behaviour: the different community types
tion of micro features within
As Table 4 demonstrates, the findings from the analysis highly
Feature
CoP
Team
Tech
Focus with
1.682 (1.680)
1.391 to be
1.382 (1.534)
correlate Dis’ what users expressed(1.581) relevant for each
Initiation
community type.7.788 (21.525) 13.235 (23.361) 11.753 (17.182) of
We(77.607) 21.130 (72.298) 3.088 (6.676)
previously found that high levels
Contribution 26.084
Initiation and Contribution are discriminative (3.920) of
Popularity
1.660 (3.647)
2.302 (2.900)
2.286 factors
Engagement
1.016 (1.556)
1.948 (2.324)
1.036 communiTeam and CoP communities with respect to Tech ( 1.575)
From User Needs to Community Health: Mining User Behaviour to Analyse Online Communities
ties. Additionally, by looking at the behaviour distributions
25. Understanding Needs Satisfaction
24
¨
Agreement between users’ needs and how users
behave
¤ Reflected
by the different needs values across the
different community types
¨
Limitations of this approach:
1.
Expensive to collect survey responses
n Took
around 6 months between questionnaire publication
and results compilation
n Required contacting many users
2.
Implicit biases in reporting across community types
n Team
communities had the lowest % of responses
From User Needs to Community Health: Mining User Behaviour to Analyse Online Communities
26. 25
Part III: Predicting Community Health
from User Behaviour
From User Needs to Community Health: Mining User Behaviour to Analyse Online Communities
27. Community Health and User Behaviour
26
¨
Management of communities is helped by:
¤ Understanding
how behaviour and health are
related
n How
user behaviour changes are associated with health
¤ Predicting
n Enables
¨
health changes
early decision making on community policy
Can we accurately detect changes in community
health from the behaviour of its users?
From User Needs to Community Health: Mining User Behaviour to Analyse Online Communities
28. Dataset 2: SAP Community Network
27
¨
Collection of SAP forums in which users discuss:
¤ Software
development, SAP Products, Usage of SAP tools
Points system for awarding best answers
¨ Provided with a dataset covering 33 communities:
¨
2004 - 2011
¤ 95,200 threads, 421,098 messages, 32,942 users
Post Count
0 200
600
1000
1400
¤ Spanning
2004
2005
2006
2007
2008
2009
2010
2011
From User Needs to Community Health: Mining User Behaviour to Analyse Online Communities
29. User Behaviour Features on SAP
28
¨
Focus Dispersion
¤
¨
Engagement
¤
¨
Measure: Proportion of thread replies created by the user
Initiation
¤
¨
Measure: In-degree proportioned by potential maximal in-degree
Contribution
¤
¨
Measure: Out-degree proportioned by potential maximal out-degree
Popularity
¤
¨
Measure: Forum entropy of the user
Measure: Proportion of threads that were initiated by the user
Quality
¤
Measure: Average points per post awarded to the user
From User Needs to Community Health: Mining User Behaviour to Analyse Online Communities
30. Inferring Roles from User Behaviour
29
¨
1. Construct features for community users at a given time step
¨
2. Derive bins using equal frequency binning
¤
¨
Popularity-low cutoff = 0.5, Initiation-high cutoff = 0.4!
3. Use skeleton rule base to construct rules using bin levels
¤
Popularity = low, Initiation = high -> roleA!
¤
Popularity < 0.5, Initiation > 0.4 -> roleA!
¨
4. Apply rules to infer user roles and community composition
¨
5. Repeat 1-4 for following time steps
Community Analysis through Semantic Rules and Role Composition Derivation. M Rowe, M
Fernandez, S Angeletou and H Alani. In the Journal of Web Semantics (2012)
From User Needs to Community Health: Mining User Behaviour to Analyse Online Communities
31. e as a parameter k. To judge the best model - i.e. cluster
hod and number of clusters - we measure the cohesion and
aration of a given clustering as follows: For each clustering
rithm (Ψ) we iteratively increase the number of clusters
to use where 2 ≥ k ≥ 30. At each increment of k we
rd the silhouette coefficient produced by Ψ, this is defined
a given element (i) in a given cluster as:
Mining Roles (Skeleton rule base
compilation)
30
si =
¨
bi − a i
max(ai , bi )
(3)
1. Select the tuning segment
0.03
0.0
0.00
0.01
0.02
Initiation
0.4
0.2
Dispersion
0.6
0.04
Where ai denotes the average distance to all other items
he same cluster and¨ i is given by calculating thebehaviour dimensions
b 2. Discover correlated average
ance with all other items inRemoved Engagement and and Fig. 2. kept Popularityfeature distributions in each of the 11 clusters.
each other distinct cluster Contribution, Boxplots of the (Pearson r > 0.75, p < 0.01)
¤
taking the minimum distance. The value of s i ranges Feature distributions are matched against the feature levels derived from equalfrequency binning
¨ 3. former users into behavioural
ween −1 and 1 where the Clusterindicates a poor cluster- groups
TABLE II
where distinct items are grouped role labels for clusters
together and the latter
M APPING OF CLUSTER DIMENSIONS TO LEVELS . T HE CLUSTERS ARE
¨ 4. Derive
ORDERED FROM LOW PATTERNS TO HIGH PATTERNS TO AID LEGIBILITY.
cates perfect cluster cohesion and separation. To derive
silhouette coefficient (s(Ψ(k)) for the entire clustering
Cluster
Dispersion
Initiation
Quality
Popularity
1
L
L
L
L
take the average silhouette coefficient of all items. We
0
L
M
H
L
6
L
H
M
M
that the best clustering model and number of clusters to
10
L
H
M
H
4
L
H
H
M
is K-means with 11 clusters. We found that for smaller
2,5
M
H
L
H
8,9
M
H
H
H
ter numbers (k = [3, 8]) each clustering algorithm achieves
7
H
H
L
H
3
H
H
H
H
parable performance, however as we begin to increase the
ter numbers K-means improves while the two remaining
• 1 - Focussed Novice
decision node, we measure the entropy of the dimensions and
• 2,5 - Mixed Novice
rithms produce worse cohesion and separation.
their levels across the clusters, we then choose the dimension
• 7 Distributed with
) Deriving Role Labels: -Provided Novice the most cohesive
with the largest entropy. This is defined formally as:
• 3 - Distributed Expert
separated clustering• of users we then derive role labels
8,9 - Mixed Expert
|levels|
each cluster. Role label 0derivation first Participant inspecting
• - Focussed Expert involves
H(dim) = −
p(level|dim) log p(level|dim)
(4)
• - each cluster and
dimension distribution4inFocussed Expert Initiator aligning the
6 - Knowledgeable Member
level
ibution with a level • mapping (i.e. low, mid, high). This
• 10 - Knowledgeable Sink
bles the conversion of continuous dimension ranges User Behaviour to Analyse Online Communities
From User Needs to Community Health: Mining into
rete values which our rule-based approach requires in the
eton Rule Base. To perform this alignment we assess the
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
Cluster
0
0.010
0.000
2
0.005
4
Quality
Popularity
6
8
0.015
10
0.020
Cluster
0 1 2 3 4 5 6 7 8 9
Cluster
0 1 2 3 4 5 6 7 8 9
Cluster
32. Community Health Indicators
31
¨
From the literature there is no single agreed measure of
‘community health’
¤
¨
Indicator 1: Churn Rate (loyalty)
¤
¨
Number of active contributors
Indicator 3: Seeds-to-Non-Seeds Posts Proportion (activity)
¤
¨
Proportion of users that remain
Indicator 2: User Count (participation)
¤
¨
Emergent dimensions: loyalty, participation, activity, social capital
Replied to thread starters to non-replied to
Indicator 4: Clustering Coefficient (social capital)
¤
Average of users’ clustering coefficients
From User Needs to Community Health: Mining User Behaviour to Analyse Online Communities
33. Experiment 1: Health Indicator Regression
32
¨
¨
Community management is helped by understanding
the relation between behaviour and health
Experimental Setup:
¤ Health
n
Independent vars: 9 roles with composition proportions as values @ t
n
n
E.g. @ t = k: Mixed Expert = 0.05, Distributed Novice = 0.51, etc.
Dependent var: health indicator (e.g. churn rate) @ t
n
¤ PCA
n
Indicator Linear Regression Models (per community)
E.g. @ t = k: Churn Rate= 0.21
of each community model using the model’s coefficients
Look for a common health composition pattern
From User Needs to Community Health: Mining User Behaviour to Analyse Online Communities
35. Experiment 2: Health Change Detection
34
¨
¨
Can we accurately and effectively detect positive and negative
changes in community health from its composition of behavioural
roles?
Experimental Setup
¤
¤
¤
Binary classification of indicator change using logistic regression
At t=k+1: predict increase or decrease in health indicator from t=k
Time-ordered dataset:
n
n
n
¤
Features @ t=k+1: 9 roles with composition proportions as values
Class @ t=k+1: positive (if increase from t=k), negative (if decrease)
Divide dataset into 80/20 split maintaining time-ordering
Evaluated using Area under the ROC Curve (AUC)
From User Needs to Community Health: Mining User Behaviour to Analyse Online Communities
36. Experiment 2: Health Change Detection
Results
35
ROC Curves surpass baseline for:
0.2
0.4
0.6
FPR
0.8
1.0
1.0
0.2
0.0
0.0
0.2
0.4
0.6
FPR
0.8
1.0
0.4
0.6
0.8
1.0
TPR
0.2
0.0
0.0
Clustering Coefficient
0.8
0.8
0.6
TPR
0.4
0.8
0.6
0.4
0.0
0.2
TPR
Seeds / Non−seeds Prop
1.0
User Count
1.0
Churn Rate
0.2
¤
0.0
¤
TPR
¤
Churn rate: 20/25 forums
User Count: 20/25 forums
Seeds-to-Non-Seeds: 19/25 forums
Clustering Coefficient: 17/25 forums
0.6
¤
0.4
¨
0.0
0.2
0.4
0.6
FPR
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
FPR
What makes Communities Tick? Community Health Analysis using Role Compositions. M Rowe and
H Alani. In the proceedings of the Fourth IEEE International Conference on Social Computing.
Amsterdam, to Community Health: Mining User Behaviour to Analyse Online Communities
From User NeedsThe Netherlands. (2012)
37. 36
To Summarise
From User Needs to Community Health: Mining User Behaviour to Analyse Online Communities
38. Findings
37
¨
User Behaviour is closely aligned with users’ needs
¤ Although
¨
this is expensive to collect and analyse
Accurate predictions of community health from
behaviour
¤ Inferring
roles from collective behaviour
¤ Forecasting from role compositions
¨
Community Managers can understand how their
community will develop from user behaviour
¤ Requires
model tuning per-community
Community Analysis through Semantic Rules and Role Composition Derivation. M Rowe, M
Fernandez, S Angeletou and H Alani. In the Journal of Web Semantics (2012)
From User Needs to Community Health: Mining User Behaviour to Analyse Online Communities
39. Current/Future Work: Lifecycles
38
¨
Limitation of role-composition approach is the use of
platform-wide windowing:
¤ Lack
¨
of high-fidelity behaviour inspection per-user
Lifecycles periods: user-specific stages of
development
First Post
1
2
1
#posts
3
2
=
…
Last Post
n
Divide lifetime into equal activity periods
#posts
From User Needs to Community Health: Mining User Behaviour to Analyse Online Communities
40. users fo
by people who have contacted them before and that fewer
tently pe
novel users appear. The same is also true for the out-degree
We find
distributions: users contact fewer new people than they did
where d
before. This is symptomatic of community platforms where
the latte
despite new users arriving within the platform, users form
demonst
sub-communities in which they interact and communicate
SAP we
¨ Capture period-specific user properties (in period s):
with In-degreeindividuals. Figure 2(c) also demonstrates that
the same distribution
initially
¤
usersOut-degree distribution over time and thus produce a s while fo
tend to reuse language
¤
gradually decaying cross-entropy curve.
cross-en
¤ Term distribution
suggesti
to diverg
Facebook
SAP
This effe
Server Fault
Enabling: Churn prediction, stage-based recommendation whe
[2]
begin w
1.2
0.30
G
G
G
G
G
G
G
0
G
GGGGGGGGGGGGGGG
0.2
0.5
0.8
Lifecycle Stages
1
0.00
0.00
GG
0
G
G
GG
GG
GGG
GGG
GG
G
GG
0.2
0.5
0.8
Lifecycle Stages
1
GGG
GGGGGG
GGGGGG
0.0
Cross Entropy
0.05
0.10
Cross Entropy
0.10
0.20
G
Cross Entropy
0.4
0.8
39
0.15
User Development
0
0.2
0.5
0.8
Lifecycle Stages
1
Mining User Lifecycles from Online Community Platforms and their Application to Churn
(a) In-degree
(b) Out-degree
(c) Lexical
Prediction. M Rowe. To appear in the proceedings of the International Conference on Data
Mining. Dallas, US. (2013)
From User Needs to Community Health: Mining User Behaviour to Analyse Online Communities
Figure 2.
Cross-entropies derived from comparing users’ in-degree, out-
Inspec
concentr
platform