Collaborative Similarity Measure for Intra-Graph Clustering

The 17th International Conference on Database Systems for Advanced Applications, Busan, South Korea.
The 3rd International Workshop on Social Networks and Social Web Mining*

Collaborative Similarity Measure
for Intra-Graph Clustering*
Waqas Nawaz, Young-Koo Lee, Sungyoung Lee
Department of Computer Engineering, Kyung Hee University, Korea

Thursday, April 19, 2012

Presenter
Waqas Nawaz

Data and Knowledge Engineering (DKE) Lab, Kyung Hee University Korea

Agenda

Motivation

Related Work

Proposed Method (CSM-IGC)

Experiments

Conclusion & Future Directions

Data & Knowledge Engineering Lab 2

Graphs with Multiple Attributes

Attribute of Authors

Coauthor Network of Top 200 Authors on TEL from DBLP
from manyeyes.alphaworks.ibm.com


Related Work

 Structure based clustering
 Normalized cuts [Shi and Malik, TPAMI 2000]
 Modularity [Newman and Girvan, Phys. Rev. 2004]
 Scan [Xu et al., KDD'07]
The clusters generated have a rather random distribution of
vertex properties within clusters

 OLAP-style graph aggregation
 K-SNAP [Tian et al., SIGMOD’08]
 Attributes compatible grouping
The clusters generated have a rather loose intra-cluster
structure


Example: A Coauthor Network
r1. XML

*https://wiki.engr.illinois.edu/download/attachments/186384385/VLDB09_notes.ppt
r3. XML, Skyline r2. XML

r4. XML

r5. XML
r6. XML
r9. Skyline

r10. Skyline r11. Skyline r7. XML r8. XML

Attribute-based Cluster
Structure-basedCluster
Traditional Coauthor graph
Structural/Attribute Cluster


Related Work (cont…)

 Structure/Attribute based clustering
 SA-Cluster [Yang Zhou et al., VLDB’ 2009]
• Modify the structure of the original graph
– add dummy vertex w.r.t each attribute instance
– Sparse matrix and space inefficient
• Neighborhood random walk: Matrix multiplication is performed
iteratively

• Fixed edge weights, and automatically update attribute weights

Scalability issue for medium & large graphs (time complexity)


Two-Fold Objective

 A desired clustering of attributed graph should achieve
a good balance between the following:

 Structural cohesiveness: Vertices within one cluster are
close to each other in terms of structure, while vertices
between clusters are distant from each other

 Attribute homogeneity: Vertices within one cluster have
similar attribute values, while vertices between clusters have
quite different attribute values

 And it should be Scalable to medium scale graphs


Different Graph Clustering Approaches

 Structure-based Clustering
 Vertices with heterogeneous values in a cluster

 Attribute-based Clustering
 Lose much structure information

 Structural/Attribute Cluster
 Homogeneous vertices along structure information at the
expense time complexity

 Intra-Graph Clustering
 Scalable while considering both aspects

Proposed Solution

 System Architecture Diagram

INPUT Processing Phase OUTPUT


Phase 1

 Similarity Estimation (Inspired by Jaccard Index1)
 Interaction of vertices (topology or structure)
• Weighted fraction of shared neighbors

• It will be zero for disconnected vertices
• Example: Structural similarity among
– SIM(V1, V2) = (1/3)*5 = 1.667
– SIM(V1, V3) = (1/4)*4 = 1.0
– SIM(V2, V3) = (1/4)*3 = 0.75
– V1 & V4 = (1/4)*0 = 0.0
• Transitive Property…!
– SIM(V1, V4) = SIM(V1,V3) * SIM(V3,V4)
1P. Jaccard, Etude Comparative de la Distribution Florale dans une Portion des Alpes et des Jura., Soci`et`e Vaudoise des Sciences Naturelles, Vol.37, (1901)


Transitive Property




Phase 1 (cont…)

 Similarity Estimation (Inspired by Jaccard Index1)
 Context of vertices (attributes regularity)
• Weighted fraction of shared attributes instances

• It will be zero for contextually disjoint vertices
• Example: Contextual similarity among
– Lets Wa1 = 1 and Wa2 = 2 then
– SIM(V1, V3) = (2/2) = 1.0
– SIM(V3, V4) = (1/2) = 0.5
– V1 & V4 = 0.0

1P. Jaccard, Etude Comparative de la Distribution Florale dans une Portion des Alpes et des Jura., Soci`et`e Vaudoise des Sciences Naturelles, Vol.37, (1901)


Collaborative Similarity Measure

 Structural

 Contextual

 Collaborative Measure


Phase 2

 Clustering (K-Medoid Approach)


Algorithm Details

Single Pass Similarity Calculation

Iterative
Node
Clustering


Example

Fig. 3. Scenarios for similarity
between source (green) and
destination(red) nodes
following some intermediate
nodes (yellow) (a) No direct
path exist (b) Directly
connected (c) In-directly
connected, shortest path (a) (b) (c)

Table 2. (a) Collaborative vertex V1 V2 V3 V4 V5 V6
K Clustered Vertices Density Entropy
Similarity among vertices given V1 1 2.67 1.17 0.20 0.18 0.18
in Fig. 3-c using Collaborative
𝐂𝐒𝐢𝐦 𝒗 𝒂 , 𝒗 𝒃

2 {V1,V2,V3},{V4,V5,V6} 0.42 0.133
V2 2.67 1 0.92 0.15 0.14 0.14
Similarity Measure, (b)
V3 1.17 0.92 1 0.17 0.15 0.15
Clustering results by varying 3 {V1,V3},{V2},{V4,V5,V6} 0.28 0.084
V4 0.2 0.15 0.17 1 0.92 0.92
number of clusters (K), quality
of each measure is calculated V5 0.18 0.14 0.15 0.92 1 2.5 4 {V5},{V6},{V4},{V1,V2,V3} 0.21 0.084
using Density and Entropy V6 0.18 0.14 0.15 0.92 2.5 1
(a) (b)


Experiments

 Real Dataset
 Political Blogs Dataset: 1490 vertices, 19090 edges, one
attribute political leaning
• Liberal
• Conservative

 Methods
 K-SNAP: Attributes only
 S-Cluster: Structure-based clustering
 W-Cluster: Weighted random walk strategy
 SA-Cluster: Consider both factors (matrix manipulation)
 IGC-CSM: Our proposed method


Evaluation Metrics

 Density*: intra-cluster structural cohesiveness

 Entropy*: intra-cluster attribute homogeneity

*Yang Zhou et al.,Graph Clustering Based on Structural/Attribute Similarities,Proceedings of VLDB Endowment,France (2009)


Evaluation Metrics (cont…)

 F-Measure*: has the ability to evaluate the collective
qualitative nature of the formed cluster

*Tijn Witsenburg et al., Improving the Accuracy of Similarity Measures by Using Link Information, International Symposium on
Methodologies for Intelligent Systems Edition 9, Poland (2011)


Results (Time Complexity)

 Synthetic Dataset Graph size vs. time
 Varying No. of Node

*http://www-personal.umich.edu/mejn/netdata
 Real Dataset
 Political Blog*
 No. of Clusters vs. Time


Results (Quality)

 Density Evaluation
 Clusters vs. Density Value

 Entropy Evaluation
 Clusters vs. Entropy Value


Results (Quality)

 F-Measure Estimation
 Clusters vs. F-measure Value


Conclusion

 We study the problem of graph node clustering based
on homogeneous characteristics in terms of context
and topology
 collaborative similarity measure to reflect the relational
model among pair of vertices
 k-Medoid clustering framework is adopted for grouping
similar nodes
 The resulting solution is estimated using state of the
art evaluation measures:
 Density, Entropy, and F-measure
 Comparatively scalable to medium scale graphs
without compromising on the quality of results

Thanks
Any wicky786@khu.ac.kr
Question…?
wicky786@khu.ac.kr
yklee@khu.ac.kr
yklee@khu.ac.kr
sylee@oslab.khu.ac.kr
sylee@oslab.khu.ac.kr


Collaborative Similarity Measure for Intra-Graph Clustering

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Collaborative Similarity Measure for Intra-Graph Clustering

Similar a Collaborative Similarity Measure for Intra-Graph Clustering (20)

Más de Waqas Nawaz

Más de Waqas Nawaz (12)

Último

Último (20)

Collaborative Similarity Measure for Intra-Graph Clustering

Notas del editor