Collaborative Similarity Measure for Intra-Graph Clustering
1. The 17th International Conference on Database Systems for Advanced Applications, Busan, South Korea.
The 3rd International Workshop on Social Networks and Social Web Mining*
Collaborative Similarity Measure
for Intra-Graph Clustering*
Waqas Nawaz, Young-Koo Lee, Sungyoung Lee
Department of Computer Engineering, Kyung Hee University, Korea
Thursday, April 19, 2012
Presenter
Waqas Nawaz
Data and Knowledge Engineering (DKE) Lab, Kyung Hee University Korea
2. Agenda
Motivation
Related Work
Proposed Method (CSM-IGC)
Experiments
Conclusion & Future Directions
Data & Knowledge Engineering Lab 2
3. Graphs with Multiple Attributes
Attribute of Authors
Coauthor Network of Top 200 Authors on TEL from DBLP
from manyeyes.alphaworks.ibm.com
Data & Knowledge Engineering Lab 3
4. Related Work
Structure based clustering
Normalized cuts [Shi and Malik, TPAMI 2000]
Modularity [Newman and Girvan, Phys. Rev. 2004]
Scan [Xu et al., KDD'07]
The clusters generated have a rather random distribution of
vertex properties within clusters
OLAP-style graph aggregation
K-SNAP [Tian et al., SIGMOD’08]
Attributes compatible grouping
The clusters generated have a rather loose intra-cluster
structure
Data & Knowledge Engineering Lab 4
5. Example: A Coauthor Network
r1. XML
*https://wiki.engr.illinois.edu/download/attachments/186384385/VLDB09_notes.ppt
r3. XML, Skyline r2. XML
r4. XML
r5. XML
r6. XML
r9. Skyline
r10. Skyline r11. Skyline r7. XML r8. XML
Attribute-based Cluster
Structure-basedCluster
Traditional Coauthor graph
Structural/Attribute Cluster
Data & Knowledge Engineering Lab 5
6. Related Work (cont…)
Structure/Attribute based clustering
SA-Cluster [Yang Zhou et al., VLDB’ 2009]
• Modify the structure of the original graph
– add dummy vertex w.r.t each attribute instance
– Sparse matrix and space inefficient
• Neighborhood random walk: Matrix multiplication is performed
iteratively
• Fixed edge weights, and automatically update attribute weights
Scalability issue for medium & large graphs (time complexity)
Data & Knowledge Engineering Lab 6
7. Two-Fold Objective
A desired clustering of attributed graph should achieve
a good balance between the following:
Structural cohesiveness: Vertices within one cluster are
close to each other in terms of structure, while vertices
between clusters are distant from each other
Attribute homogeneity: Vertices within one cluster have
similar attribute values, while vertices between clusters have
quite different attribute values
And it should be Scalable to medium scale graphs
Data & Knowledge Engineering Lab 7
8. Different Graph Clustering Approaches
Structure-based Clustering
Vertices with heterogeneous values in a cluster
Attribute-based Clustering
Lose much structure information
Structural/Attribute Cluster
Homogeneous vertices along structure information at the
expense time complexity
Intra-Graph Clustering
Scalable while considering both aspects
Data & Knowledge Engineering Lab 8
9. Proposed Solution
System Architecture Diagram
INPUT Processing Phase OUTPUT
Data & Knowledge Engineering Lab 9
10. Phase 1
Similarity Estimation (Inspired by Jaccard Index1)
Interaction of vertices (topology or structure)
• Weighted fraction of shared neighbors
• It will be zero for disconnected vertices
• Example: Structural similarity among
– SIM(V1, V2) = (1/3)*5 = 1.667
– SIM(V1, V3) = (1/4)*4 = 1.0
– SIM(V2, V3) = (1/4)*3 = 0.75
– V1 & V4 = (1/4)*0 = 0.0
• Transitive Property…!
– SIM(V1, V4) = SIM(V1,V3) * SIM(V3,V4)
1P. Jaccard, Etude Comparative de la Distribution Florale dans une Portion des Alpes et des Jura., Soci`et`e Vaudoise des Sciences Naturelles, Vol.37, (1901)
Data & Knowledge Engineering Lab 10
12. Phase 1 (cont…)
Similarity Estimation (Inspired by Jaccard Index1)
Context of vertices (attributes regularity)
• Weighted fraction of shared attributes instances
• It will be zero for contextually disjoint vertices
• Example: Contextual similarity among
– Lets Wa1 = 1 and Wa2 = 2 then
– SIM(V1, V3) = (2/2) = 1.0
– SIM(V3, V4) = (1/2) = 0.5
– V1 & V4 = 0.0
1P. Jaccard, Etude Comparative de la Distribution Florale dans une Portion des Alpes et des Jura., Soci`et`e Vaudoise des Sciences Naturelles, Vol.37, (1901)
Data & Knowledge Engineering Lab 12
16. Example
Fig. 3. Scenarios for similarity
between source (green) and
destination(red) nodes
following some intermediate
nodes (yellow) (a) No direct
path exist (b) Directly
connected (c) In-directly
connected, shortest path (a) (b) (c)
Table 2. (a) Collaborative vertex V1 V2 V3 V4 V5 V6
K Clustered Vertices Density Entropy
Similarity among vertices given V1 1 2.67 1.17 0.20 0.18 0.18
in Fig. 3-c using Collaborative
𝐂𝐒𝐢𝐦 𝒗 𝒂 , 𝒗 𝒃
2 {V1,V2,V3},{V4,V5,V6} 0.42 0.133
V2 2.67 1 0.92 0.15 0.14 0.14
Similarity Measure, (b)
V3 1.17 0.92 1 0.17 0.15 0.15
Clustering results by varying 3 {V1,V3},{V2},{V4,V5,V6} 0.28 0.084
V4 0.2 0.15 0.17 1 0.92 0.92
number of clusters (K), quality
of each measure is calculated V5 0.18 0.14 0.15 0.92 1 2.5 4 {V5},{V6},{V4},{V1,V2,V3} 0.21 0.084
using Density and Entropy V6 0.18 0.14 0.15 0.92 2.5 1
(a) (b)
Data & Knowledge Engineering Lab 16
17. Experiments
Real Dataset
Political Blogs Dataset: 1490 vertices, 19090 edges, one
attribute political leaning
• Liberal
• Conservative
Methods
K-SNAP: Attributes only
S-Cluster: Structure-based clustering
W-Cluster: Weighted random walk strategy
SA-Cluster: Consider both factors (matrix manipulation)
IGC-CSM: Our proposed method
Data & Knowledge Engineering Lab 17
18. Evaluation Metrics
Density*: intra-cluster structural cohesiveness
Entropy*: intra-cluster attribute homogeneity
*Yang Zhou et al.,Graph Clustering Based on Structural/Attribute Similarities,Proceedings of VLDB Endowment,France (2009)
Data & Knowledge Engineering Lab 18
19. Evaluation Metrics (cont…)
F-Measure*: has the ability to evaluate the collective
qualitative nature of the formed cluster
*Tijn Witsenburg et al., Improving the Accuracy of Similarity Measures by Using Link Information, International Symposium on
Methodologies for Intelligent Systems Edition 9, Poland (2011)
Data & Knowledge Engineering Lab 19
20. Results (Time Complexity)
Synthetic Dataset Graph size vs. time
Varying No. of Node
*http://www-personal.umich.edu/mejn/netdata
Real Dataset
Political Blog*
No. of Clusters vs. Time
Data & Knowledge Engineering Lab 20
21. Results (Quality)
Density Evaluation
Clusters vs. Density Value
Entropy Evaluation
Clusters vs. Entropy Value
Data & Knowledge Engineering Lab 21
23. Conclusion
We study the problem of graph node clustering based
on homogeneous characteristics in terms of context
and topology
collaborative similarity measure to reflect the relational
model among pair of vertices
k-Medoid clustering framework is adopted for grouping
similar nodes
The resulting solution is estimated using state of the
art evaluation measures:
Density, Entropy, and F-measure
Comparatively scalable to medium scale graphs
without compromising on the quality of results
Data & Knowledge Engineering Lab 23
24. Thanks
Any wicky786@khu.ac.kr
Question…?
wicky786@khu.ac.kr
yklee@khu.ac.kr
yklee@khu.ac.kr
sylee@oslab.khu.ac.kr
sylee@oslab.khu.ac.kr
Data & Knowledge Engineering Lab 24
Notas del editor
Many graphs with vertex attributes include social networks, World Wide Web, sensor networks, and etc.Let’s look at an example of a coauthor network of the top 200 authors on technology-enhanced learning from DBLP where a vertex represents an author and an edge represents the coauthor relationship between two authors. Each author contains multiple attributes: ID, Name, Affiliation, Research Interests, the number of coauthors, the number of publications, and etc.
There are mainly three approaches: structure based clustering,OLAP-style graph aggregation, structural/attribute clustering. Structure based clustering includes, for example, normalized cuts by Shi and Malik, modularity by Newman and Girvan and Scan byXu et al.. It only considers structure similarity but ignore the vertex attribute. Therefore, the clusters generated have a rather random distribution of vertex properties withinclusters.For the second approach, there is a recent study K-SNAP by Tian et al.. It follows the attributes compatible grouping. As a result, the clusters generated have a rather loose intra-cluster structure.
There are mainly three approaches: structure based clustering,OLAP-style graph aggregation, structural/attribute clustering. Structure based clustering includes, for example, normalized cuts by Shi and Malik, modularity by Newman and Girvan and Scan byXu et al.. It only considers structure similarity but ignore the vertex attribute. Therefore, the clusters generated have a rather random distribution of vertex properties withinclusters.For the second approach, there is a recent study K-SNAP by Tian et al.. It follows the attributes compatible grouping. As a result, the clusters generated have a rather loose intra-cluster structure.
In this paper, we will study the problem of “An Intra-Graph Clustering Based on Collaborative Similarity Measure”.Two fold objectives are:A desired clustering should achieve a good balance between the following two properties: The first is structural cohesiveness, which means vertices within one cluster are close to each otherin terms of structure, while vertices between clusters are distantfrom each other. The second is attribute homogeneity, which says vertices within one cluster have similarattribute values, while vertices between clusters have quitedifferent attribute values.And should be scalable to medium (and large) scale graphs [in terms of time complexity without compromising on the quality of the results].
For the structure-based clustering, although vertices within clusters are closely connected, they could have quite attribute values.For the attribute-based clustering, although vertices within clusters have the same attribute values, much structure information may be lost.For the structural/attribute clustering, both vertices within clusters are homogeneous, and vertices within clusters are closely connected and the graph keeps most structure information.On the other hand, Intra-Graph Clustering consider both factors (Structure and Homogeneity) for even medium scale graphs (Comparatively performs better in time as compared to the state of the art method)