This document presents a methodology for analyzing community structure across different community detection algorithms. It introduces the idea of identifying "structural classes" of communities based on the outputs of various algorithms. Feature vectors are used to characterize example communities from each class. A classifier is then trained to assess separability between classes based on how well it can predict the class of new examples. This framework reveals that annotated real communities have a distinct structure compared to algorithm outputs and that a small number of features explain similarities between algorithms. The methodology provides a way to systematically compare algorithms, understand their biases, and select suitable ones for specific applications.
5. Which community is real?
Metis “Real” Community Random Walk
Infomap Newman-Modularity Louvain
3
6. How do their structures differ?
Metis “Real” Community Random Walk
Infomap Newman-Modularity Louvain
3
7. The definition of community structure
• Community structure is not well defined
– different people have different notions of community structure
• Traditional strategy
1. start with an expectation of what a community should look like
• e.g., a set of nodes that interact more within the set than with the outside
2. define an optimization problem
3. design heuristic
4. the solution gives the desired communities
4
8. Two research questions that we address here
• A multitude of different of algorithms
– different objective functions
– different heuristics
How dissimilar are their outputs?
• Communities may differ from the
proposed mathematical constructs
– e.g., preponderance of links to the outside, as in this
figure, contrary to widely accepted notions
Which algorithms extract communities that
most closely resemble the structure of real
communities?
5
9. Two research questions that we address here
• A multitude of different of algorithms
– different objective functions
– different heuristics
How dissimilar are their outputs?
• Communities may differ from the
proposed mathematical constructs
– e.g., preponderance of links to the outside, as in this
figure, contrary to widely accepted notions
Which algorithms extract communities that
most closely resemble the structure of real
communities?
5
10. Obstacles to answering the preceding questions
• We don't know what properties communities possess
• We can't characterize communities in the absence of negative
examples
– Look at real communities and determine their structure
– do other sets that are not communities have these properties?
– every other connected set could be a negative example - intractable
– sets that are not annotated could also be communities
• We don't know what metrics we should use
– modularity, conductance, clustering coefficient...
6
11. Our plan to address the preceding questions
• Propose a methodology to analyze community structure by
using different notions of communities as references
– key idea: analyze community structure without requiring negative examples of
communities
• Scalable and comprehensive, simultaneously considering
– multiple notions of communities
– diverse domains of application
– a broad spectrum of community metrics
• Assess the structural dissimilarities between
– the output of different community detection algorithms
– the output of algorithms and real communities
7
15. Building structural classes by extracting examples
Algorithm 1
Algorithm 2
Algorithm 3
Algorithm 4
Algorithm k
9
16. Building structural classes by extracting examples
Algorithm 1 Class 1
Algorithm 2 Class 2
Algorithm 3 Class 3
Class 4
Algorithm 4
Algorithm k Class k
9
23. Measure inter-class separability from feature space
Are the classes separable?
Class Separability
Measure
Separability = Distinct structures
Feature Space
12
24. We test our methods on large-scale network datasets
• Social + Rice University
• Commercial
• Biological
Facebook+Rice with permission of Mislove et al.. Other datasets publicly available.
13
25. We furnish our framework with10 community detection algorithms
• BFS (Random connected subgraphs)
• Random-Walk-based (with and without restart)
• (α,β)-communities
• InfoMap
• Markov Clustering
• Metis
• Louvain
• Newman-Clauset-Moore
• Link Communities
14
33. First separability measure: Scatter Matrices
• Traditional methods for measuring class separability give a
single score, e.g., scatter matrices
Network
Reference: the same data with shuffled labels
• This is a global measure. We need more fine-grained
separability information of each class!
17
34. Idea: use the performance of probabilistic multi-class classifiers
Train Algorithm 1
Probabilistic k-way
classifier Algorithm 2
(SVM, k-NN)
Annotated
communities
18
35. Idea: use the performance of probabilistic multi-class classifiers
Train Algorithm 1
Probabilistic k-way
classifier Algorithm 2
(SVM, k-NN)
Annotated
communities
18
36. Idea: use the performance of probabilistic multi-class classifiers
Classify
(cross-validation)
Probabilistic k-way
classifier
(SVM, k-NN)
19
37. Idea: use the performance of probabilistic multi-class classifiers
Classify
(cross-validation)
Probabilistic k-way
classifier
(SVM, k-NN)
19
38. Idea: use the performance of probabilistic multi-class classifiers
Classify
(cross-validation)
Probabilistic k-way
classifier
(SVM, k-NN)
Pr(Algorithm 1) = 0.05
Pr(Algorithm 2) = 0.08
...
Pr(Annotated) = 0.48
19
39. Cross-validation performance indicates class separability
BFS BFS
RW0 RW0
RW15 RW15
AB AB
Structural Class
IM IM
LC LC
Louv. Louv.
Newm. Newm.
MCL MCL
Metis Metis
Ann. Ann
0.0 0.2 0.4 0.6 0.8 1.0
Probabilistic-SVM cross-validation outcome with 11 structural classes.
Data: DBLP network.
20
40. Matching annotated communities
Which algorithms extract communities that most
closely resemble the structure of annotated
communities?
21
41. Repeat the preceding experiment, leaving out the class of annotated communities
Learn Algorithm 1
Probabilistic k-way
Algorithm 2
classifier
Algorithm N
22
42. Repeat the preceding experiment, leaving out the class of annotated communities
Learn Algorithm 1
Probabilistic k-way
Algorithm 2
classifier
Algorithm N
22
43. Introduce the class of annotated communities in the test set
Classify
Probabilistic k-way
classifier
23
44. Introduce the class of annotated communities in the test set
Classify
Probabilistic k-way
classifier
23
45. Introduce the class of annotated communities in the test set
Classify
Probabilistic k-way
classifier
Pr(Algorithm 1) = 0.02
Pr(Algorithm 2) = 0.19
...
Pr(Algorithm k) = 0.12
23
46. Classification reveals that annotated resemble unstructured methods
grad BFS
RW0
Ugrad
RW15
SC
AB
HS
Network
IM
Fly
LC
Amazon
Louv.
DBLP
Newm.
LJ1 MCL
LJ2 Metis
0.0 0.2 0.4 0.6 0.8 1.0
Probabilistic-SVM classification of annotated communities into 11
structural classes structural class for 9 different networks.
24
48. The classifier confuses the two types of RW communities
BFS BFS
RW0 RW0
RW15 RW15
AB AB
Structural Class
IM IM
LC LC
Louv. Louv.
Newm. Newm.
MCL MCL
Metis Metis
Ann. Ann
0.0 0.2 0.4 0.6 0.8 1.0
Probabilistic-SVM cross-validation outcome with 11 structural classes.
Data: DBLP network.
26
49. Fisher’s discriminant ratio
A Separability Framework for Analyzing Community Structure, Bruno Abrahao, Sucheta Soundarajan, John Hopcroft, Robert Kleinberg,
To appear in ACM Transactions on Knowledge Discovery from Data (TKDD), 2013
27
50. Fisher’s discriminant ratio
A Separability Framework for Analyzing Community Structure, Bruno Abrahao, Sucheta Soundarajan, John Hopcroft, Robert Kleinberg,
To appear in ACM Transactions on Knowledge Discovery from Data (TKDD), 2013
27
51. Cross-validation performance indicates class separability
BFS BFS
RW0 RW0
RW15 RW15
AB AB
Structural Class
IM IM
LC LC
Louv. Louv.
Newm. Newm.
MCL MCL
Metis Metis
Ann. Ann
0.0 0.2 0.4 0.6 0.8 1.0
Probabilistic-SVM cross-validation outcome with 11 structural classes.
Data: DBLP network.
28
55. Can we reveal latent similarities among
community detection algorithms?
Our framework enables one to cluster algorithms that behave
similarly
32
56. Step 1: identifying the most important features
7 features out of 36 retain the discriminative power of the full set
33
57. Grouping algorithms by their tendencies
with respect to most discriminative features
High
Medium
Low
34
58. Grouping algorithms by their tendencies
with respect to most discriminative features
High
Medium
Low
34
59. Conclusion of methodology
• We present a methodology to address the complexity of
analyzing community structure, which simultaneously considers
– large number of algorithms
– multiple domains of application
– a broad spectrum of metrics to characterize community structure
• A scalable framework that enables
– researchers to compare and understand biases of new and existing community
detection algorithms
– practitioners to decide on the most suitable algorithm for particular purpose and
network
35
60. Conclusion of experimental analysis
• Our experimental analysis, which include 10 community
detection algorithms and 9 different networks analyzed with
36 properties reveals
– High variability among the output of community detection methods
– Annotated communities have a distinct structure from what we
expect
• their structure is closer to the output of baseline procedures than to that of popular
algorithms
– A small set of features explain the biases produced by different
algorithms
– We can organize the tapestry of available community detection
algorithms by grouping them with respect to similarities in behavior
36
61. Final remarks on future directions
• Traditional methods are unsupervised
– they find a particular type of community
– little sensitivity to different purposes, structures of interest and domains of
application
• Our approach suggests a supervised approach to
community detection
– user specifies what they intended to find through examples (real or synthetic)
– algorithm learns from those examples and retrieves similar structures in the
network
37
62. Thank you!
On the Separability of
Structural Classes of Communities
Bruno Abrahao
Sucheta Soundarajan
John Hopcroft
Robert Kleinberg Cornell University
38
Notas del editor
\n
Community structure captures the tendency of entities in a network to group together in meaningful subsets whose members have a distinctive relationship to one another. The identification of these subsets allows for the analysis of networks at different levels of detail, which is instrumental in illuminating the structure underlying large-scale systems.\n
In our KDD-madness video we showed you six communities from a given network, one real and the other 5 extracted using community detection algorithm. We asked you if you could tell which is real. Now I’m going to give you the answer. This is the real one, and here are the names of the procedures that generated the other 5.\n\nThe second question we asked is “what is the difference among the structure of these communities?”. This is the question that motivates our investigation.\n
In our KDD-madness video we showed you six communities from a given network, one real and the other 5 extracted using community detection algorithm. We asked you if you could tell which is real. Now I’m going to give you the answer. This is the real one, and here are the names of the procedures that generated the other 5.\n\nThe second question we asked is “what is the difference among the structure of these communities?”. This is the question that motivates our investigation.\n
In our KDD-madness video we showed you six communities from a given network, one real and the other 5 extracted using community detection algorithm. We asked you if you could tell which is real. Now I’m going to give you the answer. This is the real one, and here are the names of the procedures that generated the other 5.\n\nThe second question we asked is “what is the difference among the structure of these communities?”. This is the question that motivates our investigation.\n
In our KDD-madness video we showed you six communities from a given network, one real and the other 5 extracted using community detection algorithm. We asked you if you could tell which is real. Now I’m going to give you the answer. This is the real one, and here are the names of the procedures that generated the other 5.\n\nThe second question we asked is “what is the difference among the structure of these communities?”. This is the question that motivates our investigation.\n
In our KDD-madness video we showed you six communities from a given network, one real and the other 5 extracted using community detection algorithm. We asked you if you could tell which is real. Now I’m going to give you the answer. This is the real one, and here are the names of the procedures that generated the other 5.\n\nThe second question we asked is “what is the difference among the structure of these communities?”. This is the question that motivates our investigation.\n
In our KDD-madness video we showed you six communities from a given network, one real and the other 5 extracted using community detection algorithm. We asked you if you could tell which is real. Now I’m going to give you the answer. This is the real one, and here are the names of the procedures that generated the other 5.\n\nThe second question we asked is “what is the difference among the structure of these communities?”. This is the question that motivates our investigation.\n
In our KDD-madness video we showed you six communities from a given network, one real and the other 5 extracted using community detection algorithm. We asked you if you could tell which is real. Now I’m going to give you the answer. This is the real one, and here are the names of the procedures that generated the other 5.\n\nThe second question we asked is “what is the difference among the structure of these communities?”. This is the question that motivates our investigation.\n
In our KDD-madness video we showed you six communities from a given network, one real and the other 5 extracted using community detection algorithm. We asked you if you could tell which is real. Now I’m going to give you the answer. This is the real one, and here are the names of the procedures that generated the other 5.\n\nThe second question we asked is “what is the difference among the structure of these communities?”. This is the question that motivates our investigation.\n
In our KDD-madness video we showed you six communities from a given network, one real and the other 5 extracted using community detection algorithm. We asked you if you could tell which is real. Now I’m going to give you the answer. This is the real one, and here are the names of the procedures that generated the other 5.\n\nThe second question we asked is “what is the difference among the structure of these communities?”. This is the question that motivates our investigation.\n
In our KDD-madness video we showed you six communities from a given network, one real and the other 5 extracted using community detection algorithm. We asked you if you could tell which is real. Now I’m going to give you the answer. This is the real one, and here are the names of the procedures that generated the other 5.\n\nThe second question we asked is “what is the difference among the structure of these communities?”. This is the question that motivates our investigation.\n
Given the diverse nature of networks, the notion of meaningful communities is necessarily context dependent, involving interpretations and expectations of domain experts. Therefore, many attempts to define communities are grounded on the notion of mathematical optimization. Starting with an a priori expectation about what a community should look like, researchers specify an objective function for a search method, whose solution for a given instance provides the desired communities. This process has given rise to a a large collection of community detection algorithms, which aim at optimizing various objective functions.\n
Communities in real networks often emerge as a result of multiple driving forces that make up the underlying complex system. Therefore, the attempt to capture community structure by maximizing a given objective function may represent an unrealistic expectation. As a consequence, communities identified by methods that reflect mathematical constructs may differ structurally from real communities that arise in practice. \n\n
There is no established consensus on the question of what properties distinguish subgraphs that are communities from those that are not communities. While we can examine examples of community structure, e.g., by asking experts to identify communities in a given domain, find negative examples of community structure is a challenging task. Any other subset of nodes in the network could potentially be a negative example. In large networks, exhaustively enumerating all forms of negative examples is obviously computationally intractable. Moreover, even if we could enumerate all possible negative examples, we are still faced with the doubt about these seemingly negative-example sets also being valid communities, but only not identified by the expert.\n\n
In this paper, we present a framework to tackle these challenges through a comprehensive analysis of community properties. By using different notions of communities as references, our methodology enables the characterization of community structure without the requiring the identification of negative examples.\n\nOur method presents a scalable framework that enables researchers to understand biases and to assess the structural dissimilarity among the output of of new and existing community detection algorithms, and between the output of algorithms and communities that arise in practice. In addition, the framework serves as a tool for practitioners to decide on the most suitable algorithm for particular purpose and network. In addition, our method provides us with a way to organize the tapestry of community structure. Given the availability of a collection containing numerous algorithms in the literature , we can group those that produce similar and separate those that produce fundamentally different structures. Finally, we are able to identify what graph-theoretical properties of a subgraph are the most discriminative of community signature and what are the properties that the different community detection algorithm load their biases on. \n\n\n
We frame our approach as a class separability problem, which simultaneously handles many classes of communities and a diverse set of structural properties. To this end, we specify a learning problem in which we map the distinct communities into a feature space, where the dimensions represent measures that characterize a community's link structure. The separability of classes provides information on the extent to which different communities come from the same (or fundamentally different) distributions of feature values. \n\n
We frame our approach as a class separability problem, which simultaneously handles many classes of communities and a diverse set of structural properties. To this end, we specify a learning problem in which we map the distinct communities into a feature space, where the dimensions represent measures that characterize a community's link structure. The separability of classes provides information on the extent to which different communities come from the same (or fundamentally different) distributions of feature values. \n\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
The separability of these classes demonstrates the extent to which different algorithms output structurally distinguishable subgraphs. A feature selection analysis can then be employed to highlight the properties that exhibit the highest degree of inter-class variability, thereby making explicit the structural bias produced by different algorithms. \n The separability of the class comprising annotated communities from the classes of intrinsically-defined communities determines the extent to which community detection algorithms succeed in extracting subgraphs that are structurally comparable to the communities formed by nodes sharing extrinsic properties in common.\n\n
The separability of these classes demonstrates the extent to which different algorithms output structurally distinguishable subgraphs. A feature selection analysis can then be employed to highlight the properties that exhibit the highest degree of inter-class variability, thereby making explicit the structural bias produced by different algorithms. \n The separability of the class comprising annotated communities from the classes of intrinsically-defined communities determines the extent to which community detection algorithms succeed in extracting subgraphs that are structurally comparable to the communities formed by nodes sharing extrinsic properties in common.\n\n
The separability of these classes demonstrates the extent to which different algorithms output structurally distinguishable subgraphs. A feature selection analysis can then be employed to highlight the properties that exhibit the highest degree of inter-class variability, thereby making explicit the structural bias produced by different algorithms. \n The separability of the class comprising annotated communities from the classes of intrinsically-defined communities determines the extent to which community detection algorithms succeed in extracting subgraphs that are structurally comparable to the communities formed by nodes sharing extrinsic properties in common.\n\n
We also analyze community structure from a diverse collection of large scale real networks whose domains span biology, on-line shopping, and social systems.\n\n
We furnish our framework with a large set of structural properties and ten different community detection procedures to produce examples of different structural classes. Our selection is representative of categories of popular algorithms available in the literature. We define the first set of communities by properties intrinsic to their link structure. For our purposes, these are the sets that community detection algorithms may output. Each class of intrinsically defined communities comprises a set of examples that a specific algorithm extracts. \n\n\n
We also define communities by the context, the dynamics, or the function associated with the networks, but extrinsic to the link structure. We identify these communities through meaningful annotations provided with the datasets, such as explicit declaration of membership, product categories, grouping by protein function, and so on. In this fashion, for each network, we form a class of extrinsically-defined communities, henceforth called \\emph{annotated communities}. These communities enable a large-scale rigorous analysis of community detection methods. \n\n
We also define communities by the context, the dynamics, or the function associated with the networks, but extrinsic to the link structure. We identify these communities through meaningful annotations provided with the datasets, such as explicit declaration of membership, product categories, grouping by protein function, and so on. In this fashion, for each network, we form a class of extrinsically-defined communities, henceforth called \\emph{annotated communities}. These communities enable a large-scale rigorous analysis of community detection methods. \n\n
We also define communities by the context, the dynamics, or the function associated with the networks, but extrinsic to the link structure. We identify these communities through meaningful annotations provided with the datasets, such as explicit declaration of membership, product categories, grouping by protein function, and so on. In this fashion, for each network, we form a class of extrinsically-defined communities, henceforth called \\emph{annotated communities}. These communities enable a large-scale rigorous analysis of community detection methods. \n\n
We also define communities by the context, the dynamics, or the function associated with the networks, but extrinsic to the link structure. We identify these communities through meaningful annotations provided with the datasets, such as explicit declaration of membership, product categories, grouping by protein function, and so on. In this fashion, for each network, we form a class of extrinsically-defined communities, henceforth called \\emph{annotated communities}. These communities enable a large-scale rigorous analysis of community detection methods. \n\n
We also define communities by the context, the dynamics, or the function associated with the networks, but extrinsic to the link structure. We identify these communities through meaningful annotations provided with the datasets, such as explicit declaration of membership, product categories, grouping by protein function, and so on. In this fashion, for each network, we form a class of extrinsically-defined communities, henceforth called \\emph{annotated communities}. These communities enable a large-scale rigorous analysis of community detection methods. \n\n
\n
\n
Use the performance of existing probabilistic classifiers as a measure of separability\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
Finally, our method provides us with a way to organize the tapestry of community structure. Given the availability in the literature of a collection containing numerous algorithms, we can group those that produce similar and separate those that produce fundamentally different structures.\n\n
The first step is to identify what properties of a subgraph are the most discriminative of community signature and what are the properties that the different community detection algorithm heavily load their biases on. \n\n