1. Making More Sense Out of
Social Data
Harith
Alani
h+p://people.kmi.open.ac.uk/harith/
@halani
harith-alani
@halani
4th
Workshop
on
Linked
Science
2014—
Making
Sense
Out
of
Data
(LISC2014)
ISWC
2014
-‐
Riva
del
Garda,
Italy
2. Topics
• Social media monitoring"
• Behaviour role analysis"
• Semantic sentiment "
• Engagement in microblogs"
• Cross platform and topic studies"
• Semantic clustering"
• Application examples"
3. Take home messages
• Social media has many more challenges and
opportunities to offer"
• Fusing semantics and statistical methods is gooood"
• Studying isolated social media platforms is baaaad … or
not good enough … anymore!"
4. Sociograms
• Capturing and graphing social
relationships"
• Moreno founder of sociograms and
sociometry"
• Assessing psychological well-being
from social configurations of individuals
and groups"
Friendship
Choices
Among
Fourth
Graders
(from
Moreno,
1934,
p.
38
h+p://diana-‐jones.com/wp-‐content/uploads/EmoRons-‐Mapped-‐by-‐New-‐Geography.pdf
5. Computational Social Science
Behaviour role Analysis
“A
field
is
emerging
that
leverages
the
capacity
to
collect
and
analyze
data
at
a
scale
that
may
reveal
pa+erns
of
individual
and
group
behaviours.”
“what
does
exisRng
sociological
network
theory,
built
mostly
on
a
foundaRon
of
one-‐Rme
“snapshot”
data,
typically
with
only
dozens
of
people,
tell
us
about
massively
longitudinal
data
sets
of
millions
of
people
..
?”
Original
slide
by
Markus
Strohmaier
h+p://gking.harvard.edu/files/LazPenAda09.pdf
6. Social semantic linking …. in 2003
!
• Domain
ontologies
• SemanRcs
for
integraRng
people,
projects,
and
publicaRons
• IdenRfy
communiRes
of
pracRce
• Browse
evoluRon
of
social
relaRonships
and
collaboraRons
Alani,
H.;
Dasmahapatra,
S.;
O'Hara,
K.and
Shadbolt,
N.
IdenRfying
communiRes
of
pracRce
through
ontology
network
analysis.
IEEE
Intelligent
Systems,
18(2)
2003.
7. Linking
scientists ….
in 2005
• Who
is
collaboraRng
with
whom?
• How
funding
programmes
impacted
collaboraRons
over
Rme?
data
sources
gatherers
and
mediators
ontology
knowledge
repository
(triplestore)
applicaRons
Alani,
H.;
Gibbins,
N.;
Glaser,
H.;
Harris,
S.
and
Shadbolt,
N.
Monitoring
research
collaboraRons
using
semanRc
web
technologies.
ESWC,
Crete,
2005.
14. Challenges and
Opportunities • Integration"
– How to represent and connect
this data?"
• Behaviour"
– How can we measure and
predict behaviour?"
– Which behaviours are good/bad
in which community type?"
• Community Health"
– What health signs should we
look for? "
– How to predict this health?"
• Engagement"
– How can we measure and
maximise engagement? "
• Sentiment"
– How to measure it? "
– Track it towards entities and
contexts? "
16. SemanRc
Web
&
Linked
Data
SemanRc
SenRment
Analysis
lurkers)
ini#ators)
followers)
leaders)
Macro/Micro
Behaviour
Analysis
StaRsRcal
Analysis
Community
Engagement
Cumulative density functions of each dimension showing
distributions for initiated and in-degree ratio
and do not deviate away, at the other ex-treme
users are found to post in a large range
initiated (initiation) and in-degree ratio
the density functions are skewed towards
where only a few users initiate discussions
to by large portions of the community.
points per post (quality) is also skewed to-wards
values indicating that the majority of users
the best answers consistently.
indicate that feature levels derived from
distributions will be skewed towards lower values,
initiated the definition of high for this
anything exceeding 1.55x10−5.
distribution of each dimension is shown in Fig-ure
Figure 8: Boxplots of the feature distributions in each of the 11 Feature distributions are matched against the feature levels from equal-frequency binning
ping. This mapping is shown in Table 2 where certain
clusters are combined together as they have the same
feature-level mapping patterns (i.e. 5,7 and 8,9). then interpreted the role labels from these clusters, and
their subsequent patterns, as follows:
• 0 - Focussed Expert Participant: this user type
provides high quality answers but only within forums that they do not deviate from. They
also have a mix of asking questions and answering
them.
• 1 - Focussed Novice: this user is focussed within few select forums but does not provide good qual-ity
Technologies
19. Semantically-Interlinked Online
Communities (SIOC)
• SIOC is an ontology for representing and integrating data from the social web"
• Simple, concise, and popular"
SRll
seeking
the
one
size
that’ll
fit
all
sioc-project.org
20. SIOC for Discussion forums
• SIOC is well
tailored to fit
discussion forum
communities"
• Needs extension
to fit other
communities,
such as
microblogs and
Q&A"
26. Why we monitor behaviour?
• Understand role of people in a community
• Monitor impact of behaviour on community evolution
• Forecast community future
• Learn which behaviour should be encouraged or discouraged
• Find the best mix of behaviour to increase engagement in an online community
• See which users need more support, which ones should be confined, and which ones
should be promoted
28. Linking people via sensors, social
media, papers, projects
<?xml version="1.0"?>!
<rdf:RDF!
xmlns="http://
tagora.ecs.soton.ac.uk/schemas/
tagging#"!
xmlns:rdf="http://www.w3.org/
1999/02/22-rdf-syntax-ns#"!
xmlns:xsd="http://www.w3.org/2001/
XMLSchema#"!
xmlns:rdfs="http://www.w3.org/
2000/01/rdf-schema#"!
xmlns:owl="http://www.w3.org/
2002/07/owl#"!
xml:base="http://
tagora.ecs.soton.ac.uk/schemas/
tagging">!
<owl:Ontology rdf:about=""/>!
<owl:Class rdf:ID="Post"/>!
<owl:Class rdf:ID="TagInfo"/>!
<owl:Class
rdf:ID="GlobalCooccurrenceInfo"/>!
<owl:Class
rdf:ID="DomainCooccurrenceInfo"/>!
<owl:Class rdf:ID="UserTag"/>!
<owl:Class
rdf:ID="UserCooccurrenceInfo"/>!
<owl:Class rdf:ID="Resource"/>!
<owl:Class rdf:ID="GlobalTag"/>!
<owl:Class rdf:ID="Tagger"/>!
<owl:Class rdf:ID="DomainTag"/>!
<owl:ObjectProperty
rdf:ID="hasPostTag">!
<rdfs:domain
rdf:resource="#TagInfo"/>!
</owl:ObjectProperty>!
<owl:ObjectProperty
rdf:ID="hasDomainTag">!
<rdfs:domain
rdf:resource="#UserTag"/>!
</owl:ObjectProperty>!
<owl:ObjectProperty
rdf:ID="isFilteredTo">!
• Integration of physical presence and online <rdfs:range
information"
rdf:resource="#GlobalTag"/>!
• <rdfs:domain
Semantic user profile generation"
rdf:resource="#GlobalTag"/>!
</owl:ObjectProperty>!
• <owl:ObjectProperty
Logging of face-to-face contact"
rdf:ID="hasResource">!
<rdfs:domain rdf:resource="#Post"/>!
<rdfs:range =…!
• Social network browsing"
• Analysis of online vs offline social networks"
Alani,
H.;
Szomszor,
M.;
Ca+uto,
C.;
den
Broeck,
W.;
Correndo,
G.
and
Barrat,
A..
Live
social
semanRcs.
ISWC,
Washington,
DC,
2009
29. 1.2"
1"
0.8"
0.6"
0.4"
0.2"
0"
Online+offline social networks
H.Index"
F2F"Degree"
F2F"Strength"
1" 5" 9" 13" 17" 21" 25" 29" 33" 37" 41" 45"
• What’s
your
social
configura-on?
• What
does
it
say
about
you?
• And
what
you’ll
become?
Barrat,
A.;
C.,
Ca+uto;
M.,
Szomszor;
W.,
Van
den
Broeck
and
Alani,
H.
Social
dynamics
in
conferences:
analyses
of
data
from
the
Live
Social
SemanRcs
applicaRon.
ISWC,
Shanghai,
China,
2010.
31. 1.000 0.274 0.086 0.909**
0.274 1.000 -0.059 0.513
0.086 -0.059 1.000 0.065
0.909** 0.513 0.065 1.000
Clustering for identifying emerging roles
– Map the distribution of each
feature in each cluster to a
level (i.e. low, mid, high)
– Align the mapping patterns
with role labels
Figure 8: Boxplots of the feature distributions in each of the 11 clus-ters.
Mapping Table 2: Mapping of cluster of cluster dimensions dimensions to to levels
levels
Cluster Dispersion Initiation Quality Popularity
0 L M H L
1 L L L L
2 M H L H
3 H H H H
4 L H H M
5,7 H H L H
6 L H M M
8,9 M H H H
10 L H M H
• 1 - Focussed Novice: focussed within a few
select forums but does not provide good quality
content.
• 2 - Mixed Novice: a novice across a medium
range of topics
• 3 - Distributed Expert: expert on a variety of
topics and participates across many different
forums
….
• 3 - Distributed Expert: an expert on a variety of
topics and participates across many different fo-rums
• 4 - Focussed Expert Initiator: similar to cluster
0 in that this type of user is focussed on certain
topics and is an expert on those, but to a large ex-tent
starts discussions and threads, indicating that
his/her shared content is useful to the community
• 5.7 - Distributed Novice: participates across a
range of forums but is not knowledgeable on any
topics
33. Behaviour role extraction from
Social Media Data
Structural, social network,
reciprocity, persistence,
participation
• Bottom Up analysis"
– Every community member is
classified into a “role”"
– Unknown roles might be identified"
– Copes with role changes over time "
iniRators
lurkers
followers
leaders
Feature levels change with
the
dynamics of the community
Associations of roles with a collection of
feature-to-level mappings
e.g. in-degree -> high, out-degree ->
high
Run rules over each user’s features
and derive the community role
composition
Angeletou,
S;
Rowe,
M,
and
Alani,
H.
Modelling
and
analysis
of
user
behaviour
in
online
communiRes.
ISWC
2011,
Bonn,
Germany
34. Correlation of behaviour roles with community activity
• How certain behaviour roles impact activity in different community types?"
Forum
on
CommuRng
and
Transport
Forum
on
Rugby
Forum
on
Mobile
Phones
and
PDAs
35. Community types
• So do communities of different types behave differently?
• Analysed IBM Connections communities to study participation,
activity, and behaviour of users
• Compare exhibited community with what users
say they use the community for
– Does macro behaviour match micro needs?
36. Community types
Community
Wiki
Page
Blog
Post
Forum
Thread
Wiki
Edit
Blog
Comment
Forum
Reply
Tag
Bookmark
File
§ Data consists of non-private
info on IBM
Connections Intranet
deployment
§ Communities:
§ ID
§ Creation date
§ Members
§ Used applications
(blogs, Wikis,
forums)
§ Forums:
§ Discussion threads
§ Comments
§ Dates
§ Authors and
responders
37. Community types
• Muller, M. (CHI 2012) identified five distinct community types in
IBM Connections:"
– Communities of Practice (CoP): for sharing information and network"
– Teams: shared goal for a particular project or client"
– Technical Support: support for a specific technology"
– Idea Labs Communities: for focused brainstorming "
– Recreation Communities: recreational activities unrelated to work.
"
• Our data consisted of 186 most active communities:"
– 100 CoPs, 72 Teams, and 14 Technical Support communities "
– No Ideas of Recreation communities"
38. Behaviour roles in different community
types
• Members of Team communities are
more engaged, popular, and initiate
more discussions
• Technical Support community
members are mostly active in a few
communities, and don’t initiate or
contribute much!
• CoP members are active across
many communities, and contribute
more
Rowe, M. Fernandez, M., Alani, H., Ronen, I., Hayes, C., Karnstedt, M.: Behaviour Analysis across different types of Enterprise Online Communities. WebSci 2012
39. Behaviour roles and community health
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
Churn Rate
False Positive FPR
Rate
TPR
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
User Count
FPR
TPR
• Machine learning models to predict
community health based on compositions and
evolution of user behaviour
• Churn rate: proportion of community leavers in a
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
Seeds / Non−seeds Prop
FPR
TPR
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
Clustering Coefficient
FPR
TPR
given time segment.
• User count: number of users who posted at least
once.
• Seeds to Non-seeds ratio: proportion of posts that
get responses to those that don’t
• Cluster coefficient: extent to which the community
forms a clique.
Health
categories
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
Seeds / Non−seeds Prop
FPR
TPR
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
Clustering Coefficient
FPR
TPR
False Positive Rate
False Positive Rate False Positive Rate
True Positive Rate True Positive Rate
True Positive Rate True Positive Rate
The
fewer
Focused
Experts
in
the
community,
the
more
posts
will
received
a
reply!
There
is
no
“one
size
fits
all”
model!
Rowe,
M.
and
Alani,
H.
What
makes
communiRes
Rck?
Community
health
analysis
using
role
composiRons.
SocialCom
2012,
Amsterdam,
The
Netherlands.
41. Semantic sentiment analysis on social media
• Range of features and statistical classifiers have been used in
social media sentiment analysis in recent years
• Semantics have often been overlooked
– Semantic Features
– Semantic Patterns
• Semantic concepts
can help
determining
sentiment even
when no good
lexical clues are
present
42. Sentiment Analysis
hate negative
honest positive
inefficient negative
Love positive
…
Sentiment Lexicon
I really love the iPhone
I hate the iPhone
Lexical-Based Approach
Naïve
Bayes,
SVM,
MaxEnt
,
etc.
Learn
Model
Apply
Model
Training
Set
Test
Set
Model
Machine Learning Approach
43. Semantic Concept Extraction
• Extract semantic concepts from tweets data and incorporate
them into the supervised classifier training.
OpenCalais and Zemanta. Their experimental results showed that AlchemyAPI best for entity extraction and semantic concept mapping. Our datasets consist informal tweets, and hence are intrinsically different from those used in [10]. There-fore
we conducted our own evaluation, and randomly selected 500 tweets from the STS
corpus and asked 3 evaluators to evaluate the semantic concept extraction outputs gen-erated
from AlchemyAPI, OpenCalais and Zemanta.
No. of Concepts Entity-Concept Mapping Accuracy (%)
Extraction Tool Extracted Evaluator 1 Evaluator 2 Evaluator 3
AlchemyAPI 108 73.97 73.8 72.8
Zemanta 70 71 71.8 70.4
OpenCalais 65 68 69.1 68.7
Table 2. Evaluation results of AlchemyAPI, Zemanta and OpenCalais.
The assessment of the outputs was based on (1) the correctness of the extracted
entities; and (2) the correctness of the entity-concept mappings. The evaluation results
presented in Table 2 show that AlchemyAPI extracted the most number of concepts
and it also has the highest entity-concept mapping accuracy compared to OpenCalais
and Zematna. As such, we chose AlchemyAPI to extract the semantic concepts from
our three datasets. Table 3 lists the total number of entities extracted and the number semantic concepts mapped against them for each dataset.
STS HCR OMD
No. of Entities 15139 723 1194
No. of Concepts 29 17 14
Table 3. Entity/concept extraction statistics of STS, OMD and HCR using AlchemyAPI.
44. Impact of adding semantic features
• Incorporating semantics increases accuracy against the
baseline by:
– 6.5% for negative sentiment,
– 4.8% for positive sentiment
– F1 = 75.95%, with 77.18% Precision and 75.33% Recall
Destroy(((Invading(Germs((
Nega%ve' Nega%ve'Concept'
• OK, but what about
such cases?
• Can semantics help?
Saif,
H.,
He,
Y.
and
Alani,
H.
SemanRc
senRment
analysis
of
twi+er.
ISWC
2012,
Boston,
US.
45. Semantic Pattern Approaches
• Apply
syntac-c
and
seman-c
processing
techniques
• Use
external
semanRc
resources
(e.g.
Dbpedia,
Freebase)
to
idenRfy
semanRc
concepts
in
Tweets
Threat
Trojan
Horse
Hack
Code
Program
Malware
Dangerous
Harm
Spyware
• Extract
clusters
of
similar
contextual
semanRcs
and
senRment,
and
use
as
pa+erns
in
senRment
analysis
46. Tweet-Level Sentiment Analysis
Features
Based
on
9
Twi+er
datasets
MaxEnt Classifier
Accuracy F-Measure
Minimum Maximum Average Minimum Maximum Average
Syntactic
Twitter Features -0.23 3.91 1.24 -0.25 4.53 1.62
POS -0.89 2.92 0.79 -0.91 5.67 1.25
Lexicon -0.44 4.23 1.30 -0.38 5.81 1.83
Average -0.52 3.69 1.11 -0.52 5.33 1.57
Semantic
Concepts -0.22 2.76 1.20 -0.40 4.80 1.51
LDA-Topics -0.47 3.37 1.20 -0.68 6.05 1.68
SS-Patterns 0.70 9.87 3.05 1.23 9.78 3.76
Average 0.00 5.33 1.82 0.05 6.88 2.32
Table 6: Win/Loss in Accuracy and F-measure of using different features for sentiment classifica-tion
on all nine datasets.
Win/Loss
in
Accuracy
and
F-‐measure
of
using
different
features
for
senRment
classificaRon
on
all
nine
datasets.
classifier described in Section 4.2. Note that STS-Gold is the only dataset among the
other 9 that provides named entities manually annotated with their sentiment labels
(positive, negative, neutral). Therefore, our evaluation in this task is done using the
Hassan
S.,
He,
Y.,
Miriam
F.and
Harith
A.,
SemanRc
Pa+erns
for
SenRment
Analysis
of
Twi+er,
ISWC
2014,
Trento,
Italy
47. Entity-Level Sentiment Analysis
67.00
65.00
63.00
61.00
59.00
57.00
55.00
Gold
standard
of
58
enRRes
Accuracy
F1
Unigrams
LDA-‐Topics
SemanRc
Concepts
SS-‐Pa+erns
Hassan
S.,
He,
Y.,
Miriam
F.and
Harith
A.,
SemanRc
Pa+erns
for
SenRment
Analysis
of
Twi+er,
ISWC
2014,
Trento,
Italy
56. Ask the (Social) Data
• What’s the model of good/bad tweets?"
• What features are associated with each group?"
57. term influenced by external factors. Properties influencing popularity include
content - generally referred to as content features. In Table 1 we define user and
content features and study their influence on the discussion “continuation”.
user attributes - describing the reputation of the user - and attributes of a post’s
content - generally referred to as content features. In Table 1 we define user and
content features and study their influence on the discussion “continuation”.
Feature Engineering
Table 1. User and Content Features
User Features
Table 1. User and Content Features
In Degree: Number of followers of U #
Out Degree: Number of users U follows #
List Degree: Number of lists U appears User on. Features
Lists group users by topic #
Post Count: Total number of posts the user has ever posted #
In Degree: Number of followers of U #
Out Degree: Number of users U follows #
List Degree: Number of lists U appears on. Lists group users by topic #
Post Count: Total number of posts the user has ever posted #
User Age: Number of minutes from user join date #
Post Rate: Posting frequency of the user PostCount
UserAge
Content Features
User Age: Number of minutes from user join date #
Post Rate: Posting frequency of the user PostCount
Post length: Length of the post in characters #
Complexity: Cumulative entropy of the unique words in post p λ
UserAge
Content Features
of total word length n and pi the frequency of each word
!
i∈[1,n] pi(log λ−log pi)
Post length: Length of the post in characters #
Complexity: Cumulative entropy of the unique words in post p λ
Uppercase count: Number of uppercase words #
Readability: Gunning fog index using average sentence length (ASL) [7]
of total word length n and pi the frequency of each word
λ
!
i∈[1,n] pi(log λ−log pi)
and the percentage of complex words (PCW). 0.4(ASL + PCW)
λ
Uppercase count: Number of uppercase words #
Verb Count: Number of verbs #
Noun Count: Number of nouns #
Readability: Gunning fog index using average sentence length (ASL) [7]
and the percentage of complex words (PCW). 0.4(ASL + PCW)
Adjective Count: Number of adjectives #
Referral Verb Count: Count: Number Number of of @verbs user #
#
Time Noun in the Count: day: Number Normalised of nouns time in the day measured in minutes #
#
Informativeness: Terminological novelty of the post wrt other posts
Adjective Count: Number of adjectives #
Referral Count: The Number cumulative of @user tfIdf value of each term t in post p
#
Time in Polarity: the day: Cumulation Normalised time of polar in the term day weights measured in p in (using
minutes #
Informativeness: Terminological novelty of the post wrt other posts
Sentiwordnet3 lexicon) normalised by polar terms count Po+Ne
The cumulative tfIdf value of each term t in post p
!
t∈p tfidf(t, p)
!
t∈p tfidf(t, p)
Polarity: Cumulation of polar term weights in p (using
|terms|
Sentiwordnet3 lexicon) normalised by polar terms count Po+Ne
|terms|
• Focus Features"
– Topic entropy: the distribution of the author across community forums"
– Topic Likelihood: the likelihood that a user posts in a specific forum given his post history"
4.2 Experiments
Experiments are intended to test the performance of different classification mod-els
• Measures the affinity that a user has with a given forum"
• Lower likelihood indicates a user posting on an unfamiliar topic"
4.2 Experiments
Experiments are intended to test the performance of different classification mod-els
in identifying seed posts. Therefore we used four classifiers: discriminative
classifiers Perceptron and SVM, the generative classifier Naive Bayes and the
decision-tree classifier J48. For each classifier we used three feature settings:
user features, content features and user+content features.
in identifying seed posts. Therefore we used four classifiers: discriminative
classifiers Perceptron and SVM, the generative classifier Naive Bayes and the
58. Classification of Posts
Seed Posts Non-Seed
Posts
§ Binary classification model
§ Trained with social, content,
and combined features
§ 80/20 training/testing
§ Identify best feature types, and
top individual features, in
predicting post classification
59. Engagement on Boards.ie
• Which posts are
more likely to
stimulate
responses and
discussions?"
• What impacts
engagement
more; user
features, post
content, forum
affinity?"
• Which individual
features are most
influential?"
60. Top Features for Engagement on Boards.ie
• Content features were key!"
• Best predictions were achieved when combining user, content, and focus features"
• URLs (Referral Count) in a post negatively impact discussion activity"
• Seed Posts (posts that receive replies) are associated with greater activity levels, and because it has alreadfyorubme elinkeluihsoeodd"in other
Lower informativeness
is associated with seed
posts"
– i.e. seeds use
investigations (e.g., [14]).
Boards.ie does not provide explicit social relations be-tween
community members, unlike for example Facebook and
language that is
familiar to the
community"
Twitter. We followed the same strategy proposed in [3] for
extracting social networks from Digg, and built the Boards.ie
social network for users, weighting edges cumulatively by the
number of replies between any two users.
TABLE I
DESCRIPTION OF THE BOARDS.IE DATASET
Posts Seeds Non-Seeds Replies Users
1,942,030 90,765 21,800 1,829,465 29,908
• Rowe,
M.;
Angeletou,
S.
and
Alani,
H.
AnRcipaRng
discussion
acRvity
on
community
forums.
SocialCom
2011,
Boston,
MA,
USA.
61. former dataset contains tweets which relate to the Haiti earthquake disaster,
covering a varying timespan. The latter dataset contains all tweets published
during the duration of president Barack Obama’s State of the Union Address
speech. Our goal is to predict discussion activity based on the features of a given
post by first identifying seed posts, before moving on to predict the discussion
level.
12 user-age (0.015) content-noun-count (0.002)
15 13 content-adj-uppercase-count (count 0.005) (0.012) content-adj-readability count (0.0)
(0.001)
16 14 content-complexity noun-count ((0.0) 0.010) content-informativeness verb-count (0.001)
(17 15 adj-count (0.005) adj-count (0.0)
16 content-complexity (0.0) content-informativeness (17 content-verb-count (0.0) content-uppercase-count (Fig. 3. Contributions of top-5 features to identifying Non-seeds (N) Upper plots are for the Haiti dataset and the lower plots are for the dataset.
Top Features for Engagement on Twitter
• Top are list-degree,
in-degree,
Within the above datasets many of the posts are not seeds, but are instead
replies to previous posts, thereby featuring in the discussion chain as a node.
In [13] retweets are considered as part of the discussion activity. In our work
we identify discussions using the explicit “in reply to” information obtained
by the Twitter API, which does not include retweets. We make this decision
based on the work presented in boyd et.al [4], where an analysis of retweeting
as a discussion practice is presented, arguing that message forwards adhere different motives which do not necessarily designate a response to the initial
message. Therefore, we only investigate explicit replies to messages. To gather
our discussions, and our seed posts, we iteratively move up the reply chain - i.from reply to parent post - until we reach the seed post in the discussion. We
define this process as dataset enrichment, and is performed by querying Twitter’s
REST API6 using the in reply to id of the parent post, and moving one-step a time up the reply chain. This same approach has been employed successfully
in work by [12] to gather a large-scale conversation dataset from Twitter.
informativeness,
and #posts"
"
• Top are list-degree,
time of
posting, in-degree,
and
#posts"
content-verb-count (0.0) content-uppercase-count (Fig. 3. Contributions of top-5 features to identifying Non-seeds (N) Upper plots are for the Haiti dataset and the lower plots are for the dataset.
HaiR
Earthquake
State
Union
Address
Table 2. Statistics of the datasets used for experiments
The top-most ranks from each dataset are dominated by user features Dataset Users Tweets Seeds Non-Seeds Replies
Haiti 44,497 65,022 1,405 60,686 2,931
Union Address 66,300 80,272 7,228 55,169 17,875
Rowe,
M.,
Angeletou,
S.,
Alani,
H.
PredicRng
Discussions
on
the
Social
SemanRc
Web.
ESWC,
Crete,
2011
Table 2 shows the statistics that explain our collected datasets. One can
62. Top Features for Engagement on Twitter –
Earth Hour 2014
neg pos
0 5 10 15 20 25 30
Length
neg pos
0.0 0.5 1.0 1.5
Complexity
neg pos
0 10 20 30 40
Readability
neg pos
−4 −2 0 2 4
Polarity
• Top influential
features do not
match those found
for Board.ie or for
two non-random
Twitter datasets"
63. Top Features for Engagement on Twitter –
Dorset Police
neg pos
5 10 15 20 25 30
Length
neg pos
0.6 0.8 1.0 1.2 1.4
complexity
neg pos
−4 −3 −2 −1 0 1 2 3
polarity
neg pos
0 1 2 3 4 5 6 7
mentions
!
• Top 4 features
share 3 with
Twitter Earth
Hour dataset"
Fernandez,
M.,
Cano,
E.,
and
Alani,
H.
Policing
Engagement
via
Social
Media.
CityLabs
workshop,
SocInfo,
Barcelona,
2014
64.
65. Publications about social media
by
Katron
Weller
-‐
h+p://kwelle.files.wordpress.com/2014/04/figure1.jpg
66. Moving on …
§ How can we move on
from these (micro)
studies?
§ Are results consistent
across datasets, and
platforms?
§ One way forward is:
§ Multiple platforms
§ Multiple topics
71. Apples and Oranges
• We mix and
compare different
datasets, topics,
and platforms
• Aim is to test
consistency and
transferability of
results
72. 7 datasets from 5 platforms
Pla1orm
Posts
Users
Seeds
Non-‐seeds
Replies
Boards.ie
6,120,008
65,528
398,508
81,273
5,640,227
Twi+er
Random
1,468,766
753,722
144,709
930,262
390,795
Twi+er
(HaiR
Earthquake)
65,022
45,238
1,835
60,686
2,501
Twi+er
(Obama
State
of
Union
Address)
81,458
67,417
11,298
56,135
14,025
SAP
427,221
32,926
87,542
7,276
332,403
Server
Fault
234,790
33,285
65,515
6,447
162,828
Facebook
118,432
4,745
15,296
8,123
95,013
Seed posts are those that receive a reply
Non-seed posts are those with no replies
73. Data Balancing
Pla1orm
Seeds
Non-‐seeds
Instance
Count
Boards.ie
398,508
81,273
162,546
Twi+er
Random
144,709
930,262
289,418
Twi+er
(HaiR
1,835
60,686
3,670
Earthquake)
Twi+er
(Obama
State
of
Union
Address)
11,298
56,135
22,596
SAP
87,542
7,276
14,552
Server
Fault
65,515
6,447
12,894
Facebook
15,296
8,123
16,246
Total
521,922
For each dataset, an equal number of seeds and non-seed
posts are used in the analysis.
74. Classification Results
Feature
P
R
F1
Social
0.592
0.591
0.591
Content
0.664
0.660
0.658
Social+Content
0.670
0.666
0.665
(Random)
(HaiR
Earthquake)
(Obama’s
State
Union
Address)
P
R
F1
0.561
0.561
0.560
0.612
0.612
0.611
0.628
0.628
0.628
P
R
F1
0.968
0.966
0.966
0.752
0.747
0.747
0.974
0.973
0.973
Feature
P
R
F1
Social
0.542
0.540
0.539
Content
0.650
0.642
0.639
Social+Content
0.656
0.649
0.646
P
R
F1
0.650
0.631
0.628
0.575
0.541
0.521
0.652
0.632
0.629
P
R
F1
0.528
0.380
0.319
0.626
0.380
0.275
0.568
0.407
0.359
Feature
P
R
F1
Social
0.635
0.632
0.632
Content
0.641
0.641
0.641
Social+Content
0.660
0.660
0.660
§ Performance
of
the
logisRc
regression
classifier
trained
over
different
feature
sets
and
applied
to
the
test
set.
75. Effect of features on engagement
Boards.ie
β
2
1
0
−1
−2
Twitter Random
β
1.0
0.5
0.0
−0.5
Twitter Haiti
6e+16
4e+16
2e+16
0e+00
−2e+16
−4e+16
−6e+16
Twitter Union
0.2
0.0
−0.2
β
−0.4
−0.6
−0.8
Server Fault
β
2.0
1.5
1.0
0.5
0.0
−0.5
−1.0
SAP
β
5
0
−5
−10
Facebook
β
0.5
0.4
0.3
0.2
0.1
0.0
−0.1
In−degree
Out−degree
Post Count
Age
Post Rate
Post Length
Referrals Count
Polarity
Complexity
Readability
Readability Fog
Informativeness
Logistic regression coefficients for each platform's features
76. Comparison
to literature
§ How performance
of our shared
features compare
to other studies on
different datasets
and platforms?
80. Semantic Clustering
• Statistical models play important roles in social data
analyses
• Keeping such models up to date often means regular,
expensive, and time consuming retraining
• Semantic Features are likely to decay more slowly than
lexical features
• Could adding semantics to the models extend their value
and life expectancy?
Cano,
E.,
He,
Y.,
Alani,
H.
Stretching the Life of Twitter Classifiers with Time-Stamped Semantic Graphs. ISWC 2014, Trento, Italy.
81. Semantic Representation of a Tweet
<dbo:PresidentOfUnitedStateofAmerica>
<skos:Nobel_Peace_Price_laureates>
rdf:type
dcterms:subject
<dbp:Barack_Obama>
dbprop:nationality
American
<skos:English-language_television_stations>
<skos:PresidentsOfEgypt>
<dbp:Hosni_Mubarak>
<dbp:CNN>
<dbp:Egypt>
dbprop:languages
<dbp:Egyptian_Arabic>
<skos:Arab_republics>
dcterms:subject
dcterms:subject
<dbp:Country>
rdf:type
rdf:type
82. Evolution of Semantics
• Renewed DBpedia Graph snapshots are taken over time"
• Semantic features updated based on new knowledge in
DBpedia"
v3.6 v3.7 v3.8
<Budget_Control_Act_of_2011>
wikiPageWikiLink
<Barack_Obama>
<UnitedStatesPresidentialCandidates>
<Hawaii>
spouse
<MechelleObama>
birth1place
wikiPageWikiLink
83. Experiments
Extending fitness of model to
proceedings epochs
• 12,000 annotated tweets"
• Adding Classes as clustering features provide best performance"
Cross-‐
Epoch
2010-‐2011 2010-‐2013 2011-‐2013 Average
F1 F1 F1
BoW 0.634
0.481 0.261 0.458
Category 0.683
0.539 0.524 0.582
Property 0.665
0.557 0.502 0.603
Resource 0.774
0.544 0.445 0.587
Class 0.691
0.665 0.669 0.675
Same-‐
epoch
2010-‐2010 2011-‐2011 Average
BoW 0.831 0.875 0.845
85. What policymakers really want from Social
Media?
1. "Fish where the fish is"
– one interface to access multiple SNS"
– layman monitoring of users and topics "
2. "My consistency first"
– communicating with users in own
constituency"
– find local groups, events, and topics"
3. "What are their needs, complaints, and
preferences?"
– what citizens talk about, complain about"
– what are the top 5-10 topics of the day"
4. Who should I talk to?"
– who are the influential citizens"
– whom to engage with"
5. What about Tomorrow?"
– which topics will get hotter?"
– which discussions are likely to grow
further?"
6. Presence and popularity"
– what writing recipe to follow to reach more
people"
7. Privacy"
– concerns on citizens’ privacy when
extracting info"
– concerns on their own privacy with 3rd
party SNS access tools"
Interviews
with
31
policymakers
86. Wandhöfer,
T.;
Taylor,
S.;
Alani,
H.;
Zoshi,
S.;
Sizov,
S.;
et
al.
Engaging
poliRcians
with
ciRzens
on
social
networking
sites:
the
WeGov
Toolbox.
IJEGR,
8(3),
2012
87. Monitoring SCN
"
Monitoring of
evolution of
community
activities and level
of contributions in
SAP Community
Networks – SCN "
Demo
88. SCN Behaviour
"
Community managers can monitor behaviour composition of forums, and its
association to activity evolution "
91. Course
tutors
Real
Rme
monitoring
Behaviour
Analysis
SenRment
Analysis
Topic
Analysis
• How
acRve
the
engaged
the
course
group
is?
• How
is
senRment
towards
a
course
evolving?
• Are
the
leaders
of
the
group
providing
posiRve/negaRve
comments?
• What
topics
are
emerging?
• Is
the
group
flourishing
or
diminishing?
• Do
students
get
the
answers
and
support
they
need?
Thomas,
K.;
Fernández,
M.;
Brown,
S.,
Alani,
H.
OUSocial2:
a
plaxorm
for
gathering
students’
feedback
from
social
media.
(Demo)
ISWC
2014,
Trento,
Italy.
94. Thanks to ..
Hassan Saif Lara Piccolo Thomas Dickensen
Gregoire Burel
Miriam Fernandez
Smitashree Choudhury
Elizabeth Cano
Matthew Rowe
Keerthi Thomas
Sofia Angeletou
95. Heads-up
Semantic Patterns for Sentiment Analysis of Twitter
Thursday 15.40 - Session: Social Media"
Semantic Patterns for Sentiment Analysis of Twitter
Thursday 16:00 - Session: Social Media"
User Profile Modeling in Online Communities !
Sunday 2:05 pm - SWCS Workshop"
OUSocial2:
a
pla1orm
for
gathering
students’
feedback
from
social
media
(DEMO)
The
Topics
they
are
a-‐Changing
—
Characterising
Topics
with
Time-‐Stamped
Semanc
Graphs
(POSTER)"
!
Automac
Stopword
Generaon
using
Contextual
Semancs
for
Senment
Analysis
of
Twi_er
(POSTER)