Enjoy Night ≽ 8448380779 ≼ Call Girls In Palam Vihar (Gurgaon)
Blog clustering
1. School of something
Computing
FACULTY OF ENGINEERING
OTHER
Blog Clustering and Community
Discovery in the Blogosphere
An Overview
Ahmad Ammari
Research Fellow (User / Community Modelling)
2. OUTLINE
• Significance
• Research Challenges
• Network – Based Blog Clustering Approach
• Content – Based Blog Clustering Approach
• Hybrid – Based Blog Clustering Approach
• Evaluation
• Conclusion
3. The Blogosphere is Huge
100% Growth Rate for
every 5 months consistently
for the last 4 years
Over 120,000 new blogs
created every day
1.4 new Blog every second
(Technorati, 2009)
4. Why Clustering Blogs?
• For Bloggers / Readers:
o Can focus on the clusters
they “belong to”
• Improve Recommender
Engines:
o Suggest related content to
other cluster members
o Suggest similar bloggers
to network / follow
5. Why Clustering Blogs?
• For Search Engines:
o Improve indexing
mechanisms
o Improve the delivery
of the search results
by organizing similar
results together
o Enhance the
• Meta Search Engine: Yippy / Clusty
navigability of search
• Retrieve results from many engines
results
• Cluster them into 'clouds' based on
their contextual contents
6. Why Clustering Blogs?
• For Sociocultural / Political
Studies:
o Uncovering trending
social, cultural, & political
correlations within
blogging communities
• e.g. Harvard Arab
Blogosphere Study, 2009
o Baseline assessment of
networked public sphere in
Middle East Blogs
o Relationships to politics,
media, religion, culture,
international affairs
7. Research Challenges
• Existing approaches in webpage clustering & web community
discovery are explored in the blogosphere
• Applicability Challenges due to Key Differences between the
Blogosphere & the Web
Blog Posts Web Pages
Short-lived References Long-lived References
Monitoring Community
Relative Temporal Stability
Temporal Dynamics
Multi-Theme Contents Focused Contents
Emergent Text Analysis Traditional Text Analysis
Missing Citations Available Citations
8. Blog Clusters Vs. Community Discovery
• Research Trend: Researchers find it is more prevalent to
leverage content information to identify clusters of blog topics
and network information to discover blog communities
• Proposal: Both content and network information can be used
/ combined to identify blog Topic clusters and/or blog
communities
12. k-Means Clustering
• Assign k centroids
Randomly
• Assign points to
closest centroids
• Recalculate and
move centroids
• Repeat until
centroids are stable
13. Content – Based Estimation of W
• Blog graph could be extremely
sparse due to the casual nature 1) -neighbourhood
of bloggers
• Sparsity Solution:
o Edges between blogs are
derived using content similarity 2) k Nearest Neighbor kNN
• Given:
3) Fully Connected Graph
14. Content – Based Clustering Approaches
• Blog Contents are used to compute Similarity
• Text - Similarity Measure
o Cosine Measure
• Spherical k-Means
o Version of k-means clustering that uses cosine similarity
instead of Euclidean similarity
15. Content Pre-Processing
• Urban Dictionary: http://www.urbandictionary.com/
• Edited by People
Acronyms • 5,677,798 definitions since 1999
• Articles (a, an, the ..)
• Demonstratives (this, that, these ..) • Conjunctions (for, and, both …)
Stop Words
Removal • Quantifiers (all, few, many … ) • Prepositions (on ,beneath, over …)
• Affix Stemmers e.g indefinitely definite
• Porter’s stemmer (Suffix Stripping)
Stemming
Weighting
18. Hybrid - based Clustering approach
• Blog Community can be defined as a set of nodes
in a graph that link more frequently within this set
than outside it and the set shares similar tags
(Java et al, 2008)
19. Evaluation
• Data Set Description
• First Data Set: citation network of academic publications
o Six categories: Agents, Artificial Intelligence (AI), Databases
(DB), Human Computer Interaction (HCI), Information
Retrieval (IR) and Machine Learning (ML)
o Binary document-term matrix (Presence / Absence of Terms)
• Second Data Set: Subgraph of Weblogging Ecosystems (WWE)
workshop
o Tags fetched from del.icio.us, a well-known social
bookmarking site
o Corresponding Homepages downloaded
• Performed Clustering Performance Comparisons between
Hybrid & NCut (Network – based) Approaches
20. Tag Distribution in Discovered Communities
Top five tags associated with
10 communities found using
the Ncut Approach
Top five tags associated with
10 communities found using
Hybrid Clustering
23. Conclusion
• Both content and network information can be used to
identify blog clusters or blog communities
• Accompanying content information (user – defined tags,
unstructured contents, agglomerative terms / features) with
network information lead to better coherent blog clusters
and more distinct blog communities than restricted network
– based information
• Matrix Factorization Techniques (LSA, SVD) reduce
Sparsity and High Dimensionality of Content – based
Clustering Information whereas Threshold – based filtration
techniques are used
• There should be more work to be done to consider the
temporal dynamics in blog clustering for blogging
interaction patterns and community evolutions monitoring
24. School of something
Computing
FACULTY OF ENGINEERING
OTHER
Thank You
Ahmad Ammari
Research Fellow (User / Community Modelling)