Conor Hayes - Topics, tags and trends in the blogosphere
1. WebCamp 07 Social Networks
Topics Tags and Trends in the
Blogosphere
Conor Hayes
DERI, Galway
2. WebCamp 07
Social Networks
Outline
Blogs and the Blogosphere
Linking Blogs - Content vs. Tags
Topics and Bloggers
User entropy
Topic drift
Blog reactivity
Identifying consistent, topic–relevant blogs
3. WebCamp 07
Social Networks
Blogs
Web site with journal style entries
Dated in reverse chronological order
Generally written by a single user
Regularly updated
Distributed publishing
Easy to maintain and update
Exponential growth:
Technorati: 50 millions blogs (July 2006), doubling in size
every 6 months
Increasingly important indicator of public opinion on
politics, technology, current affairs
4. WebCamp 07
Social Networks
Blogs vs. Usenet
Blogosphere is user-centred
Distributed architecture
Topic organisation is locally defined by tags
Not easy to find relevant posts related to the same topic
Usenet is topic-centred
Logically centralised architecture
Topic organisation is a priori defined by newsgroup
heading, and subject headers
Users know where to go to find information on a particular
topic
5. WebCamp 07
Social Networks
A topic-centred blogosphere?
Semantic Web
Link blogs using machine readable metadata : SIOC
Tagging
Tags: simple propositional entities, locally defined
Link analysis:
Majority of blogs have little or no inward connections
Blog Roll, Comment List
Relatively static group
‘Conventional’ Knowledge discovery techniques
Clustering + online recommender systems
6. WebCamp 07
Social Networks
Nearest Neighbour Recommender
Method :
1. Periodically, identify a set of nearest neighbour
blogs – one set for each topic the user is
interested in.
2. Select matching posts from these neighbours
What are the implications of User drift ?
How quickly does the neighbourhood set change
What is the relationship between Users and
Topics?
How consistently are bloggers attached to topics?
Which neighbours consistently provide the most
topic- relevant information?
7. WebCamp 07
Social Networks
Experiments
We cluster blog data over different time periods
user entropy: measures whether bloggers remain
together over time
topic drift: measure blogger behaviour in relation to
Topic growth and drift:
We identify the most relevant blogs in each cluster
using tag analysis
8. WebCamp 07
Social Networks
Data
We collected blog data from Jan16 to Feb 27, 2006
7200 blogs in total
We created 6 data sets, one for each week
mean of 4250 blogs per week
70% overlap between consecutive weeks
Each instance in each data set contains the posts
from a single tag, from a single blogger
An instance is only included in the data set for a
week only if the user has posted in that week
9. WebCamp 07
Social Networks
Clustering
Goals: Uncover latent structures reflecting topics in
the collection and provide a means of summarisation
Spherical k-means : partitions document corpus into
k disjoint groups of documents
Produces interpretable concept summary for each
group
Clustering quality:
Blogs in the same cluster should
be similar;
Blogs in different clusters should
be dissimilar.
Hr: Ratio of intra- to inter-
cluster similarity
13. WebCamp 07
Social Networks
Methodology
We cluster each data set in date order at different k
We reuse the cluster centroids in window t to seed the
clusters in window t+1
14. WebCamp 07
Social Networks
User Drift
We define User Entropy: a measure of the degree
of user dispersion between windows
wint+n
q : number of clusters at wint+n
containing users from cluster r
nr
i
: number of users from cluster r
contained in cluster i at wint+n
nr: number of users from cluster r
available at wint+n
wint
16. WebCamp 07
Social Networks
Proportion of Users
= mean fraction of dataset contained in top 20% of clusters
= mean fraction of dataset contained in bottom 20% of clusters
17. WebCamp 07
Social Networks
User Drift vs. Cluster Strength
(mean) correlation: Hr vs. Ur at k
-0.9
-0.8
-0.7
-0.6
-0.5
-0.4
-0.3
-0.2
-0.1
0
0 10 20 30 40 50 60 70 80 90 100
k
pearsonR.
18. WebCamp 07
Social Networks
User Drift Conclusions
Even where n is low, user dispersion occurs
As n increases user entropy also increases,
suggesting that the ‘relationship’ between users
based on shared topics is short lived
User dispersion is related to cluster strength
Strong clusters experience less user drift than weak
clusters
However, the fraction of data from strong clusters is
smaller than the fraction from weak clusters, by at
least a factor of 2
We will return to user entropy later in the talk
19. WebCamp 07
Social Networks
Topic drift
inter window similarity Wr
t+1
Wr
t+1
for a cluster r at wint is the similarity between the
centroid of cluster r and the centroid of the
corresponding cluster r at wint+n
Wr
t+n
= cos(Cr,t, Cr,t+n)
Intuitively, Wr
t+n
is a measure of the drift of the
concept centroid, Cr,from wint to wint+n
20. WebCamp 07
Social Networks
Topic Drift vs. User Drift
(mean) correlation: Wr vs. Ur at k
-0.9
-0.8
-0.7
-0.6
-0.5
-0.4
-0.3
-0.2
-0.1
0
0 10 20 30 40 50 60 70 80 90 100
k
pearsonR.
21. WebCamp 07
Social Networks
A Model of Topic Drift
Topic drift is related to user
drift, but topics may be
more stable than users
By observation: rate of
topic change is less than
rate of user drift
Our analysis so far would
suggest that users are not
firmly fixed to topics, rather
they drift between topics
over time.
Type a = {X,Y,Z}
Type b= {P,Q,R}
23. WebCamp 07
Social Networks
Observations
The blogosphere responds quickly to breaking news
stories
The relationship between topic and user drift is
pronounced where topic drift is extreme
Otherwise, there is steady turnover of users around
relatively stable concepts
Users ‘float’ between topics
25. WebCamp 07
Social Networks
Tag clouds
In previous work we showed that tags perform badly
at grouping similar blog posts together
26. WebCamp 07
Social Networks
Tag clouds
Clustering followed by tag analysis allowed us to
determine clusters contain strong concepts
It also allowed us to fragment the global tag space to
produce local tag clouds
28. WebCamp 07
Social Networks
A-bloggers are
1. more similar to each other than c- bloggers
2. more similar to the cluster centroid (topic definition)
than c-bloggers
3. more similar to pages retrieved from Google using
the topic description
29. WebCamp 07
Social Networks
Entropy: a-blogs vs c-blogs
A-blog entropy is lower
As interval increases a-blogs experience smaller
increases in entropy
Suggests that a-bloggers tend to write consistently
about the same things over time
30. WebCamp 07
Social Networks
A-blogs Example
Cluster 28 in Win5; k =50
Cluster description: mobile, internet, weblog, web, patent
A-blogs
1) “Comunications: technology, economic and social issues at the intersection of
telecom, mobility and the Internet”
2) “IP Blawg”: technology and Intellectual property blog
3) “Small business IP management blog: Patent, Trademark, Copyright, Internet,
and Technology Law”
4) “Open Gardens: Wireless mobility, Digital convergence - Mobile web 2.0”
5) “Mobile Enterprise Weblog: the voice of enterprise mobility management”
C-blogs
1) “Digital Music Den: Digital Music, online music marketing”
2) "icarusindie.com – blog about nothing”: general computing and technology
3) “Dunkie's Saga” - personal blog: personal, cars, games, quizzes, some
technology
4) “Complex Christ – a vision for church that is organic, networked, decentralized,
bottom-up, emergent, communal, flexible, always evolving”
5) “Philips Brooks patent infringement updates”: legal blog on general patent
issues (pharmaceutical as well as technological)
31. WebCamp 07
Social Networks
Conclusions
We have accumulated empirical evidence to suggest
that a-bloggers are topic authorities
Tend to form tight subgroups close to cluster topic
definition
Consistently more similar to pages ranked by Google using
the cluster topic definition
Tend to stay together at differerent clusterings over time. In
other words they tend to write regularly about the same
topic
What characteristics does the a-blogger have?
A blogger that is aware of a wider potential readership and
chooses his/her tags so that they can be understood easily
by others
Writes regularly in depth about fairly narrowly defined
subjects
New professional bloggers
32. WebCamp 07
Social Networks
Future Work
Produce tag hierarchies using hierarchical clustering
Combine this with the work of Hak Lae Kim and Dr.
Suk-Hyung Hwang:
Formal concept model of a blog social network using
tags and content analysis
Tag recommender: Enrich SIOC topic descriptions
with tag cloud meta data
Develop a set of style features to classify blogs
33. WebCamp 07
Social Networks
References
Hayes, C., Avesani, P (2007) Using Tags and Clustering to
Identify Topic Relevant Blogs in Proceedings of the
International Conference on Weblogs and Social Media
(ICWSM 07)
Hayes, C., Avesani, P., Veeramachaneni, S. (2007) An
Analysis of the Use of Tags in a Blog Recommender
System. In proceedings of IJCAI-07, the International Joint
Conference on Artificial Intelligence
Hayes, C., Avesani, P., Veeramachaneni, S. (2006) An
Analysis of Bloggers and Topics for a Blog Recommender
System in proceedings of the Workshop on Web Mining
(Webmine 06) , 7th European Conference on Machine
Learning and the 10th European Conference on Principles