Conor Hayes - Topics, tags and trends in the blogosphere

WebCamp 07 Social Networks
Topics Tags and Trends in the
Blogosphere
Conor Hayes
DERI, Galway

WebCamp 07
Social Networks
Outline
 Blogs and the Blogosphere
 Linking Blogs - Content vs. Tags
 Topics and Bloggers
 User entropy
 Topic drift
 Blog reactivity
 Identifying consistent, topic–relevant blogs

WebCamp 07
Social Networks
Blogs
 Web site with journal style entries
 Dated in reverse chronological order
 Generally written by a single user
 Regularly updated
 Distributed publishing
 Easy to maintain and update
 Exponential growth:
 Technorati: 50 millions blogs (July 2006), doubling in size
every 6 months
 Increasingly important indicator of public opinion on
politics, technology, current affairs

WebCamp 07
Social Networks
Blogs vs. Usenet
 Blogosphere is user-centred
 Distributed architecture
 Topic organisation is locally defined by tags
 Not easy to find relevant posts related to the same topic
 Usenet is topic-centred
 Logically centralised architecture
 Topic organisation is a priori defined by newsgroup
heading, and subject headers
 Users know where to go to find information on a particular
topic

WebCamp 07
Social Networks
A topic-centred blogosphere?
 Semantic Web
 Link blogs using machine readable metadata : SIOC
 Tagging
 Tags: simple propositional entities, locally defined
 Link analysis:
 Majority of blogs have little or no inward connections
 Blog Roll, Comment List
 Relatively static group
 ‘Conventional’ Knowledge discovery techniques
 Clustering + online recommender systems

WebCamp 07
Social Networks
Nearest Neighbour Recommender
 Method :
1. Periodically, identify a set of nearest neighbour
blogs – one set for each topic the user is
interested in.
2. Select matching posts from these neighbours
 What are the implications of User drift ?
 How quickly does the neighbourhood set change
 What is the relationship between Users and
Topics?
 How consistently are bloggers attached to topics?
 Which neighbours consistently provide the most
topic- relevant information?

WebCamp 07
Social Networks
Experiments
 We cluster blog data over different time periods
 user entropy: measures whether bloggers remain
together over time
 topic drift: measure blogger behaviour in relation to
Topic growth and drift:
 We identify the most relevant blogs in each cluster
using tag analysis

WebCamp 07
Social Networks
Data
 We collected blog data from Jan16 to Feb 27, 2006
 7200 blogs in total
 We created 6 data sets, one for each week
 mean of 4250 blogs per week
 70% overlap between consecutive weeks
 Each instance in each data set contains the posts
from a single tag, from a single blogger
 An instance is only included in the data set for a
week only if the user has posted in that week

WebCamp 07
Social Networks
Clustering
 Goals: Uncover latent structures reflecting topics in
the collection and provide a means of summarisation
 Spherical k-means : partitions document corpus into
k disjoint groups of documents
 Produces interpretable concept summary for each
group
 Clustering quality:
 Blogs in the same cluster should
be similar;
 Blogs in different clusters should
be dissimilar.
 Hr: Ratio of intra- to inter-
cluster similarity

WebCamp 07
Social Networks
Clustering….

WebCamp 07
Social Networks
Partitions the document space

WebCamp 07
Social Networks
At window win t+n ?

WebCamp 07
Social Networks
Methodology
 We cluster each data set in date order at different k
 We reuse the cluster centroids in window t to seed the
clusters in window t+1

WebCamp 07
Social Networks
User Drift
 We define User Entropy: a measure of the degree
of user dispersion between windows
wint+n
q : number of clusters at wint+n
containing users from cluster r
nr
i
: number of users from cluster r
contained in cluster i at wint+n
nr: number of users from cluster r
available at wint+n
wint

WebCamp 07
Social Networks
Ur as the Interval Increases

WebCamp 07
Social Networks
Proportion of Users
= mean fraction of dataset contained in top 20% of clusters
= mean fraction of dataset contained in bottom 20% of clusters

WebCamp 07
Social Networks
User Drift vs. Cluster Strength
(mean) correlation: Hr vs. Ur at k
-0.9
-0.8
-0.7
-0.6
-0.5
-0.4
-0.3
-0.2
-0.1
0
0 10 20 30 40 50 60 70 80 90 100
k
pearsonR.

WebCamp 07
Social Networks
User Drift Conclusions
 Even where n is low, user dispersion occurs
 As n increases user entropy also increases,
suggesting that the ‘relationship’ between users
based on shared topics is short lived
 User dispersion is related to cluster strength
 Strong clusters experience less user drift than weak
clusters
 However, the fraction of data from strong clusters is
smaller than the fraction from weak clusters, by at
least a factor of 2
 We will return to user entropy later in the talk

WebCamp 07
Social Networks
Topic drift
 inter window similarity Wr
t+1
 Wr
t+1
for a cluster r at wint is the similarity between the
centroid of cluster r and the centroid of the
corresponding cluster r at wint+n
Wr
t+n
= cos(Cr,t, Cr,t+n)
 Intuitively, Wr
t+n
is a measure of the drift of the
concept centroid, Cr,from wint to wint+n

WebCamp 07
Social Networks
Topic Drift vs. User Drift
(mean) correlation: Wr vs. Ur at k
-0.9
-0.8
-0.7
-0.6
-0.5
-0.4
-0.3
-0.2
-0.1
0
0 10 20 30 40 50 60 70 80 90 100
k
pearsonR.

WebCamp 07
Social Networks
A Model of Topic Drift
 Topic drift is related to user
drift, but topics may be
more stable than users
 By observation: rate of
topic change is less than
rate of user drift
 Our analysis so far would
suggest that users are not
firmly fixed to topics, rather
they drift between topics
over time.
Type a = {X,Y,Z}
Type b= {P,Q,R}

WebCamp 07
Social Networks
Muhammad Cartoon Controversy
time
Jan 16
Jan 23
Jan 30
Feb 06
Feb 13
Feb 20

WebCamp 07
Social Networks
Observations
 The blogosphere responds quickly to breaking news
stories
 The relationship between topic and user drift is
pronounced where topic drift is extreme
 Otherwise, there is steady turnover of users around
relatively stable concepts
 Users ‘float’ between topics

WebCamp 07
Social Networks
Identifying the most relevant blogs
 Technorati uses a tag cloud to link posts by different
blogs together

WebCamp 07
Social Networks
Tag clouds
 In previous work we showed that tags perform badly
at grouping similar blog posts together

WebCamp 07
Social Networks
Tag clouds
 Clustering followed by tag analysis allowed us to
determine clusters contain strong concepts
 It also allowed us to fragment the global tag space to
produce local tag clouds

WebCamp 07
Social Networks
A-bloggers
 A-bloggers: only a portion (<0.4) of blogs in each
cluster contributes tags to the local tag cloud
description

WebCamp 07
Social Networks
A-bloggers are
1. more similar to each other than c- bloggers
2. more similar to the cluster centroid (topic definition)
than c-bloggers
3. more similar to pages retrieved from Google using
the topic description

WebCamp 07
Social Networks
Entropy: a-blogs vs c-blogs
 A-blog entropy is lower
 As interval increases a-blogs experience smaller
increases in entropy
 Suggests that a-bloggers tend to write consistently
about the same things over time

WebCamp 07
Social Networks
A-blogs Example
 Cluster 28 in Win5; k =50
 Cluster description: mobile, internet, weblog, web, patent
A-blogs
1) “Comunications: technology, economic and social issues at the intersection of
telecom, mobility and the Internet”
2) “IP Blawg”: technology and Intellectual property blog
3) “Small business IP management blog: Patent, Trademark, Copyright, Internet,
and Technology Law”
4) “Open Gardens: Wireless mobility, Digital convergence - Mobile web 2.0”
5) “Mobile Enterprise Weblog: the voice of enterprise mobility management”
C-blogs
1) “Digital Music Den: Digital Music, online music marketing”
2) "icarusindie.com – blog about nothing”: general computing and technology
3) “Dunkie's Saga” - personal blog: personal, cars, games, quizzes, some
technology
4) “Complex Christ – a vision for church that is organic, networked, decentralized,
bottom-up, emergent, communal, flexible, always evolving”
5) “Philips Brooks patent infringement updates”: legal blog on general patent
issues (pharmaceutical as well as technological)

WebCamp 07
Social Networks
Conclusions
 We have accumulated empirical evidence to suggest
that a-bloggers are topic authorities
 Tend to form tight subgroups close to cluster topic
definition
 Consistently more similar to pages ranked by Google using
the cluster topic definition
 Tend to stay together at differerent clusterings over time. In
other words they tend to write regularly about the same
topic
 What characteristics does the a-blogger have?
 A blogger that is aware of a wider potential readership and
chooses his/her tags so that they can be understood easily
by others
 Writes regularly in depth about fairly narrowly defined
subjects
 New professional bloggers

WebCamp 07
Social Networks
Future Work
 Produce tag hierarchies using hierarchical clustering
 Combine this with the work of Hak Lae Kim and Dr.
Suk-Hyung Hwang:
 Formal concept model of a blog social network using
tags and content analysis
 Tag recommender: Enrich SIOC topic descriptions
with tag cloud meta data
 Develop a set of style features to classify blogs

WebCamp 07
Social Networks
References
 Hayes, C., Avesani, P (2007) Using Tags and Clustering to
Identify Topic Relevant Blogs in Proceedings of the
International Conference on Weblogs and Social Media
(ICWSM 07)
 Hayes, C., Avesani, P., Veeramachaneni, S. (2007) An
Analysis of the Use of Tags in a Blog Recommender
System. In proceedings of IJCAI-07, the International Joint
Conference on Artificial Intelligence
 Hayes, C., Avesani, P., Veeramachaneni, S. (2006) An
Analysis of Bloggers and Topics for a Blog Recommender
System in proceedings of the Workshop on Web Mining
(Webmine 06) , 7th European Conference on Machine
Learning and the 10th European Conference on Principles

Conor Hayes - Topics, tags and trends in the blogosphere

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (19)

Destacado

Destacado (15)

Similar a Conor Hayes - Topics, tags and trends in the blogosphere

Similar a Conor Hayes - Topics, tags and trends in the blogosphere (20)

Último

Último (20)

Conor Hayes - Topics, tags and trends in the blogosphere