SlideShare una empresa de Scribd logo
1 de 62
Descargar para leer sin conexión
Text
On the Separability of
Structural Classes of Communities
Bruno Abrahao
Sucheta Soundarajan
John Hopcroft
Robert Kleinberg      Cornell University
                          1
The idea of community structure as distinctive relationships




                                              [Newman-Girvan, 2004]

                                 2
Which community is real?




                           3
Which community is real?




                           3
Which community is real?




      Metis       “Real” Community      Random Walk




    Infomap         Newman-Modularity     Louvain
                            3
How do their structures differ?




      Metis       “Real” Community      Random Walk




    Infomap         Newman-Modularity     Louvain
                            3
The definition of community structure

• Community structure is not well defined
  – different people have different notions of community structure


• Traditional strategy
  1. start with an expectation of what a community should look like
     • e.g., a set of nodes that interact more within the set than with the outside
  2. define an optimization problem
  3. design heuristic
  4. the solution gives the desired communities




                                                     4
Two research questions that we address here

• A multitude of different of algorithms
  – different objective functions
  – different heuristics
  How dissimilar are their outputs?


• Communities may differ from the
  proposed mathematical constructs
  – e.g., preponderance of links to the outside, as in this
    figure, contrary to widely accepted notions
  Which algorithms extract communities that
  most closely resemble the structure of real
  communities?




                                              5
Two research questions that we address here

• A multitude of different of algorithms
  – different objective functions
  – different heuristics
  How dissimilar are their outputs?


• Communities may differ from the
  proposed mathematical constructs
  – e.g., preponderance of links to the outside, as in this
    figure, contrary to widely accepted notions
  Which algorithms extract communities that
  most closely resemble the structure of real
  communities?




                                              5
Obstacles to answering the preceding questions

• We don't know what properties communities possess

• We can't characterize communities in the absence of negative
  examples
  –   Look at real communities and determine their structure
  –   do other sets that are not communities have these properties?
  –   every other connected set could be a negative example - intractable
  –   sets that are not annotated could also be communities

• We don't know what metrics we should use
  – modularity, conductance, clustering coefficient...




                                             6
Our plan to address the preceding questions

• Propose a methodology to analyze community structure by
  using different notions of communities as references

  – key idea: analyze community structure without requiring negative examples of
    communities


• Scalable and comprehensive, simultaneously considering
  – multiple notions of communities
  – diverse domains of application
  – a broad spectrum of community metrics


• Assess the structural dissimilarities between
  – the output of different community detection algorithms
  – the output of algorithms and real communities


                                           7
Building structural classes by extracting examples




  Algorithm             Network




                             8
Building structural classes by extracting examples




              Apply



  Algorithm             Network




                             8
Building structural classes by extracting examples




              Apply



  Algorithm             Network         Extract community
                                        examples




                             8
Building structural classes by extracting examples

      Algorithm 1



      Algorithm 2




      Algorithm 3




      Algorithm 4



      Algorithm k


                             9
Building structural classes by extracting examples

      Algorithm 1                              Class 1



      Algorithm 2                              Class 2



      Algorithm 3                              Class 3




                                               Class 4
      Algorithm 4



      Algorithm k                              Class k


                             9
Building a feature space by characterizing examples




 Labeled Example




                             10
Building a feature space by characterizing examples




 Labeled Example

                       Feature Vector



                              10
Building a feature space by characterizing examples




                             11
Building a feature space by characterizing examples




                                    Feature Space
                             11
Measure inter-class separability from feature space




            Feature Space




                              12
Measure inter-class separability from feature space




                                   Class Separability
                                   Measure




            Feature Space




                              12
Measure inter-class separability from feature space




                                   Are the classes separable?



                                     Class Separability
                                     Measure



                                   Separability = Distinct structures
            Feature Space




                              12
We test our methods on large-scale network datasets



   • Social                                                   + Rice University




   • Commercial



   • Biological



Facebook+Rice with permission of Mislove et al.. Other datasets publicly available.



                                                              13
We furnish our framework with10 community detection algorithms


  •   BFS (Random connected subgraphs)
  •   Random-Walk-based (with and without restart)
  •   (α,β)-communities
  •   InfoMap
  •   Markov Clustering
  •   Metis
  •   Louvain
  •   Newman-Clauset-Moore
  •   Link Communities




                                   14
Annotated communities identify exemplar communities
 Metadata included in the datasets



                                     + Rice University




                                     15
Annotated communities identify exemplar communities
 Metadata included in the datasets



                                     + Rice University




                                     15
Annotated communities identify exemplar communities
 Metadata included in the datasets



                                     + Rice University




                                     15
Annotated communities identify exemplar communities
 Metadata included in the datasets



                                     + Rice University




                                     15
Annotated communities identify exemplar communities
 Metadata included in the datasets



                                     + Rice University




                                     15
Annotated communities identify exemplar communities
 Metadata included in the datasets



                                     + Rice University




                                     15
To what extent are the classes separable?




                     16
First separability measure: Scatter Matrices


• Traditional methods for measuring class separability give a
  single score, e.g., scatter matrices

       Network




        Reference: the same data with shuffled labels


• This is a global measure. We need more fine-grained
  separability information of each class!



                                           17
Idea: use the performance of probabilistic multi-class classifiers




      Train                                         Algorithm 1




      Probabilistic k-way
      classifier                                    Algorithm 2
      (SVM, k-NN)




                                                     Annotated
                                                     communities



                                    18
Idea: use the performance of probabilistic multi-class classifiers




      Train                                         Algorithm 1




      Probabilistic k-way
      classifier                                    Algorithm 2
      (SVM, k-NN)




                                                     Annotated
                                                     communities



                                    18
Idea: use the performance of probabilistic multi-class classifiers



      Classify
      (cross-validation)



       Probabilistic k-way
       classifier
       (SVM, k-NN)




                                    19
Idea: use the performance of probabilistic multi-class classifiers



      Classify
      (cross-validation)



       Probabilistic k-way
       classifier
       (SVM, k-NN)




                                    19
Idea: use the performance of probabilistic multi-class classifiers



      Classify
      (cross-validation)



       Probabilistic k-way
       classifier
       (SVM, k-NN)

                                            Pr(Algorithm 1) = 0.05
                                            Pr(Algorithm 2) = 0.08
                                            ...
                                            Pr(Annotated) = 0.48




                                    19
Cross-validation performance indicates class separability

                     BFS                                             BFS

                    RW0                                              RW0

                   RW15                                              RW15

                      AB                                             AB
Structural Class




                      IM                                             IM

                      LC                                             LC

                    Louv.                                            Louv.

                   Newm.                                             Newm.

                    MCL                                              MCL

                    Metis                                            Metis

                    Ann.                                             Ann



                            0.0   0.2   0.4   0.6        0.8   1.0



         Probabilistic-SVM cross-validation outcome with 11 structural classes.
         Data: DBLP network.
                                                    20
Matching annotated communities




 Which algorithms extract communities that most
 closely resemble the structure of annotated
 communities?




                          21
Repeat the preceding experiment, leaving out the class of annotated communities




          Learn                                                 Algorithm 1




          Probabilistic k-way
                                                                Algorithm 2
          classifier




                                                                 Algorithm N




                                             22
Repeat the preceding experiment, leaving out the class of annotated communities




          Learn                                                 Algorithm 1




          Probabilistic k-way
                                                                Algorithm 2
          classifier




                                                                 Algorithm N




                                             22
Introduce the class of annotated communities in the test set



      Classify




       Probabilistic k-way
       classifier




                                  23
Introduce the class of annotated communities in the test set



      Classify




       Probabilistic k-way
       classifier




                                  23
Introduce the class of annotated communities in the test set



      Classify




       Probabilistic k-way
       classifier

                                        Pr(Algorithm 1) = 0.02
                                        Pr(Algorithm 2) = 0.19
                                        ...
                                        Pr(Algorithm k) = 0.12




                                  23
Classification reveals that annotated resemble unstructured methods



              grad                                             BFS


                                                               RW0
             Ugrad

                                                               RW15
               SC

                                                               AB
               HS
  Network




                                                               IM
               Fly
                                                               LC

            Amazon
                                                               Louv.

             DBLP
                                                               Newm.

               LJ1                                             MCL


               LJ2                                             Metis


                     0.0   0.2   0.4   0.6   0.8     1.0



  Probabilistic-SVM classification of annotated communities into 11
  structural classes structural class for 9 different networks.

                                        24
Improving the quality of the space




  What classes should we consider?




                             25
The classifier confuses the two types of RW communities

                     BFS                                             BFS

                    RW0                                              RW0

                   RW15                                              RW15

                      AB                                             AB
Structural Class




                      IM                                             IM

                      LC                                             LC

                    Louv.                                            Louv.

                   Newm.                                             Newm.

                    MCL                                              MCL

                    Metis                                            Metis

                    Ann.                                             Ann



                            0.0   0.2   0.4   0.6        0.8   1.0



         Probabilistic-SVM cross-validation outcome with 11 structural classes.
         Data: DBLP network.
                                                    26
Fisher’s discriminant ratio




 A Separability Framework for Analyzing Community Structure, Bruno Abrahao, Sucheta Soundarajan, John Hopcroft, Robert Kleinberg,
To appear in ACM Transactions on Knowledge Discovery from Data (TKDD), 2013




                                                                         27
Fisher’s discriminant ratio




 A Separability Framework for Analyzing Community Structure, Bruno Abrahao, Sucheta Soundarajan, John Hopcroft, Robert Kleinberg,
To appear in ACM Transactions on Knowledge Discovery from Data (TKDD), 2013




                                                                         27
Cross-validation performance indicates class separability

                     BFS                                             BFS

                    RW0                                              RW0

                   RW15                                              RW15

                      AB                                             AB
Structural Class




                      IM                                             IM

                      LC                                             LC

                    Louv.                                            Louv.

                   Newm.                                             Newm.

                    MCL                                              MCL

                    Metis                                            Metis

                    Ann.                                             Ann



                            0.0   0.2   0.4   0.6        0.8   1.0



         Probabilistic-SVM cross-validation outcome with 11 structural classes.
         Data: DBLP network.
                                                    28
Cross-validation performance indicates class separability
  Structural Class




                                29
Classification reveals that annotated resemble unstructured methods



            grad                                              BFS


                                                              RW0
           Ugrad

                                                              RW15
             SC

                                                              AB
             HS
Network




                                                              IM
             Fly
                                                              LC

          Amazon
                                                              Louv.

           DBLP
                                                              Newm.

             LJ1                                              MCL


             LJ2                                              Metis


                   0.0   0.2   0.4   0.6    0.8     1.0



Probabilistic-SVM classification of annotated communities into 11
structural classes structural class for 9 different networks.
                                     30
Classification reveals that annotated resemble unstructured methods
   Network




                                     31
Can we reveal latent similarities among
community detection algorithms?

Our framework enables one to cluster algorithms that behave
similarly




                              32
Step 1: identifying the most important features




 7 features out of 36 retain the discriminative power of the full set




                                     33
Grouping algorithms by their tendencies
with respect to most discriminative features

                                               High




                                               Medium




                                               Low



                                          34
Grouping algorithms by their tendencies
with respect to most discriminative features

                                               High




                                               Medium




                                               Low



                                          34
Conclusion of methodology

• We present a methodology to address the complexity of
  analyzing community structure, which simultaneously considers
  – large number of algorithms

  – multiple domains of application

  – a broad spectrum of metrics to characterize community structure

• A scalable framework that enables
  – researchers to compare and understand biases of new and existing community
    detection algorithms

  – practitioners to decide on the most suitable algorithm for particular purpose and
    network




                                            35
Conclusion of experimental analysis

• Our experimental analysis, which include 10 community
  detection algorithms and 9 different networks analyzed with
  36 properties reveals

  – High variability among the output of community detection methods
  – Annotated communities have a distinct structure from what we
    expect
     • their structure is closer to the output of baseline procedures than to that of popular
       algorithms
  – A small set of features explain the biases produced by different
    algorithms
  – We can organize the tapestry of available community detection
    algorithms by grouping them with respect to similarities in behavior

                                                  36
Final remarks on future directions

• Traditional methods are unsupervised
  – they find a particular type of community
  – little sensitivity to different purposes, structures of interest and domains of
    application



• Our approach suggests a supervised approach to
  community detection
  – user specifies what they intended to find through examples (real or synthetic)
  – algorithm learns from those examples and retrieves similar structures in the
    network




                                              37
Thank you!

On the Separability of
Structural Classes of Communities
Bruno Abrahao
Sucheta Soundarajan
John Hopcroft
Robert Kleinberg         Cornell University
                             38

Más contenido relacionado

Similar a On the Separability of Structural Classes of Communities

Icpc2011 syer
Icpc2011 syerIcpc2011 syer
Icpc2011 syer
SAIL_QU
 
Using content and interactions for discovering communities in
Using content and interactions for discovering communities inUsing content and interactions for discovering communities in
Using content and interactions for discovering communities in
moresmile
 
Large Scale Data Mining using Genetics-Based Machine Learning
Large Scale Data Mining using   Genetics-Based Machine LearningLarge Scale Data Mining using   Genetics-Based Machine Learning
Large Scale Data Mining using Genetics-Based Machine Learning
Xavier Llorà
 
Cs 1004 -_data_warehousing_and_data_mining
Cs 1004 -_data_warehousing_and_data_miningCs 1004 -_data_warehousing_and_data_mining
Cs 1004 -_data_warehousing_and_data_mining
hari91
 
Robust inference via generative classifiers for handling noisy labels
Robust inference via generative classifiers for handling noisy labelsRobust inference via generative classifiers for handling noisy labels
Robust inference via generative classifiers for handling noisy labels
Kimin Lee
 

Similar a On the Separability of Structural Classes of Communities (20)

Modeling and mining complex networks with feature-rich nodes.
Modeling and mining complex networks with feature-rich nodes.Modeling and mining complex networks with feature-rich nodes.
Modeling and mining complex networks with feature-rich nodes.
 
CS6010 Social Network Analysis Unit III
CS6010 Social Network Analysis   Unit IIICS6010 Social Network Analysis   Unit III
CS6010 Social Network Analysis Unit III
 
The Object Oriented Database System Manifesto
The Object Oriented Database System ManifestoThe Object Oriented Database System Manifesto
The Object Oriented Database System Manifesto
 
MARS presentation
MARS presentationMARS presentation
MARS presentation
 
ESWC 2011 BLOOMS+
ESWC 2011 BLOOMS+ ESWC 2011 BLOOMS+
ESWC 2011 BLOOMS+
 
MobiCom CHANTS
MobiCom CHANTSMobiCom CHANTS
MobiCom CHANTS
 
Jürgens diata12-communities
Jürgens diata12-communitiesJürgens diata12-communities
Jürgens diata12-communities
 
Community detection in social networks[1]
Community detection in social networks[1]Community detection in social networks[1]
Community detection in social networks[1]
 
Data mining techniques unit v
Data mining techniques unit vData mining techniques unit v
Data mining techniques unit v
 
Does JavaScript Software Embrace Classes? (Talk at SANER 2015 Conference)
Does JavaScript Software Embrace Classes? (Talk at SANER 2015 Conference)Does JavaScript Software Embrace Classes? (Talk at SANER 2015 Conference)
Does JavaScript Software Embrace Classes? (Talk at SANER 2015 Conference)
 
Icpc2011 syer
Icpc2011 syerIcpc2011 syer
Icpc2011 syer
 
A Theoretic Framework for Evaluating Similarity Digesting Tools
A Theoretic Framework for Evaluating Similarity Digesting ToolsA Theoretic Framework for Evaluating Similarity Digesting Tools
A Theoretic Framework for Evaluating Similarity Digesting Tools
 
2016 Cytoscape 3.3 Tutorial
2016 Cytoscape 3.3 Tutorial2016 Cytoscape 3.3 Tutorial
2016 Cytoscape 3.3 Tutorial
 
Group and Community Detection in Social Networks
Group and Community Detection in Social NetworksGroup and Community Detection in Social Networks
Group and Community Detection in Social Networks
 
Using content and interactions for discovering communities in
Using content and interactions for discovering communities inUsing content and interactions for discovering communities in
Using content and interactions for discovering communities in
 
Detailed syllabus
Detailed syllabusDetailed syllabus
Detailed syllabus
 
Large Scale Data Mining using Genetics-Based Machine Learning
Large Scale Data Mining using   Genetics-Based Machine LearningLarge Scale Data Mining using   Genetics-Based Machine Learning
Large Scale Data Mining using Genetics-Based Machine Learning
 
Cs 1004 -_data_warehousing_and_data_mining
Cs 1004 -_data_warehousing_and_data_miningCs 1004 -_data_warehousing_and_data_mining
Cs 1004 -_data_warehousing_and_data_mining
 
Similarity on DBpedia
Similarity on DBpediaSimilarity on DBpedia
Similarity on DBpedia
 
Robust inference via generative classifiers for handling noisy labels
Robust inference via generative classifiers for handling noisy labelsRobust inference via generative classifiers for handling noisy labels
Robust inference via generative classifiers for handling noisy labels
 

Último

Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
AnaAcapella
 

Último (20)

2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
 
Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptx
 
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxHMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptx
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Wellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptxWellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptx
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
How to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxHow to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptx
 

On the Separability of Structural Classes of Communities

  • 1. Text On the Separability of Structural Classes of Communities Bruno Abrahao Sucheta Soundarajan John Hopcroft Robert Kleinberg Cornell University 1
  • 2. The idea of community structure as distinctive relationships [Newman-Girvan, 2004] 2
  • 5. Which community is real? Metis “Real” Community Random Walk Infomap Newman-Modularity Louvain 3
  • 6. How do their structures differ? Metis “Real” Community Random Walk Infomap Newman-Modularity Louvain 3
  • 7. The definition of community structure • Community structure is not well defined – different people have different notions of community structure • Traditional strategy 1. start with an expectation of what a community should look like • e.g., a set of nodes that interact more within the set than with the outside 2. define an optimization problem 3. design heuristic 4. the solution gives the desired communities 4
  • 8. Two research questions that we address here • A multitude of different of algorithms – different objective functions – different heuristics How dissimilar are their outputs? • Communities may differ from the proposed mathematical constructs – e.g., preponderance of links to the outside, as in this figure, contrary to widely accepted notions Which algorithms extract communities that most closely resemble the structure of real communities? 5
  • 9. Two research questions that we address here • A multitude of different of algorithms – different objective functions – different heuristics How dissimilar are their outputs? • Communities may differ from the proposed mathematical constructs – e.g., preponderance of links to the outside, as in this figure, contrary to widely accepted notions Which algorithms extract communities that most closely resemble the structure of real communities? 5
  • 10. Obstacles to answering the preceding questions • We don't know what properties communities possess • We can't characterize communities in the absence of negative examples – Look at real communities and determine their structure – do other sets that are not communities have these properties? – every other connected set could be a negative example - intractable – sets that are not annotated could also be communities • We don't know what metrics we should use – modularity, conductance, clustering coefficient... 6
  • 11. Our plan to address the preceding questions • Propose a methodology to analyze community structure by using different notions of communities as references – key idea: analyze community structure without requiring negative examples of communities • Scalable and comprehensive, simultaneously considering – multiple notions of communities – diverse domains of application – a broad spectrum of community metrics • Assess the structural dissimilarities between – the output of different community detection algorithms – the output of algorithms and real communities 7
  • 12. Building structural classes by extracting examples Algorithm Network 8
  • 13. Building structural classes by extracting examples Apply Algorithm Network 8
  • 14. Building structural classes by extracting examples Apply Algorithm Network Extract community examples 8
  • 15. Building structural classes by extracting examples Algorithm 1 Algorithm 2 Algorithm 3 Algorithm 4 Algorithm k 9
  • 16. Building structural classes by extracting examples Algorithm 1 Class 1 Algorithm 2 Class 2 Algorithm 3 Class 3 Class 4 Algorithm 4 Algorithm k Class k 9
  • 17. Building a feature space by characterizing examples Labeled Example 10
  • 18. Building a feature space by characterizing examples Labeled Example Feature Vector 10
  • 19. Building a feature space by characterizing examples 11
  • 20. Building a feature space by characterizing examples Feature Space 11
  • 21. Measure inter-class separability from feature space Feature Space 12
  • 22. Measure inter-class separability from feature space Class Separability Measure Feature Space 12
  • 23. Measure inter-class separability from feature space Are the classes separable? Class Separability Measure Separability = Distinct structures Feature Space 12
  • 24. We test our methods on large-scale network datasets • Social + Rice University • Commercial • Biological Facebook+Rice with permission of Mislove et al.. Other datasets publicly available. 13
  • 25. We furnish our framework with10 community detection algorithms • BFS (Random connected subgraphs) • Random-Walk-based (with and without restart) • (α,β)-communities • InfoMap • Markov Clustering • Metis • Louvain • Newman-Clauset-Moore • Link Communities 14
  • 26. Annotated communities identify exemplar communities Metadata included in the datasets + Rice University 15
  • 27. Annotated communities identify exemplar communities Metadata included in the datasets + Rice University 15
  • 28. Annotated communities identify exemplar communities Metadata included in the datasets + Rice University 15
  • 29. Annotated communities identify exemplar communities Metadata included in the datasets + Rice University 15
  • 30. Annotated communities identify exemplar communities Metadata included in the datasets + Rice University 15
  • 31. Annotated communities identify exemplar communities Metadata included in the datasets + Rice University 15
  • 32. To what extent are the classes separable? 16
  • 33. First separability measure: Scatter Matrices • Traditional methods for measuring class separability give a single score, e.g., scatter matrices Network Reference: the same data with shuffled labels • This is a global measure. We need more fine-grained separability information of each class! 17
  • 34. Idea: use the performance of probabilistic multi-class classifiers Train Algorithm 1 Probabilistic k-way classifier Algorithm 2 (SVM, k-NN) Annotated communities 18
  • 35. Idea: use the performance of probabilistic multi-class classifiers Train Algorithm 1 Probabilistic k-way classifier Algorithm 2 (SVM, k-NN) Annotated communities 18
  • 36. Idea: use the performance of probabilistic multi-class classifiers Classify (cross-validation) Probabilistic k-way classifier (SVM, k-NN) 19
  • 37. Idea: use the performance of probabilistic multi-class classifiers Classify (cross-validation) Probabilistic k-way classifier (SVM, k-NN) 19
  • 38. Idea: use the performance of probabilistic multi-class classifiers Classify (cross-validation) Probabilistic k-way classifier (SVM, k-NN) Pr(Algorithm 1) = 0.05 Pr(Algorithm 2) = 0.08 ... Pr(Annotated) = 0.48 19
  • 39. Cross-validation performance indicates class separability BFS BFS RW0 RW0 RW15 RW15 AB AB Structural Class IM IM LC LC Louv. Louv. Newm. Newm. MCL MCL Metis Metis Ann. Ann 0.0 0.2 0.4 0.6 0.8 1.0 Probabilistic-SVM cross-validation outcome with 11 structural classes. Data: DBLP network. 20
  • 40. Matching annotated communities Which algorithms extract communities that most closely resemble the structure of annotated communities? 21
  • 41. Repeat the preceding experiment, leaving out the class of annotated communities Learn Algorithm 1 Probabilistic k-way Algorithm 2 classifier Algorithm N 22
  • 42. Repeat the preceding experiment, leaving out the class of annotated communities Learn Algorithm 1 Probabilistic k-way Algorithm 2 classifier Algorithm N 22
  • 43. Introduce the class of annotated communities in the test set Classify Probabilistic k-way classifier 23
  • 44. Introduce the class of annotated communities in the test set Classify Probabilistic k-way classifier 23
  • 45. Introduce the class of annotated communities in the test set Classify Probabilistic k-way classifier Pr(Algorithm 1) = 0.02 Pr(Algorithm 2) = 0.19 ... Pr(Algorithm k) = 0.12 23
  • 46. Classification reveals that annotated resemble unstructured methods grad BFS RW0 Ugrad RW15 SC AB HS Network IM Fly LC Amazon Louv. DBLP Newm. LJ1 MCL LJ2 Metis 0.0 0.2 0.4 0.6 0.8 1.0 Probabilistic-SVM classification of annotated communities into 11 structural classes structural class for 9 different networks. 24
  • 47. Improving the quality of the space What classes should we consider? 25
  • 48. The classifier confuses the two types of RW communities BFS BFS RW0 RW0 RW15 RW15 AB AB Structural Class IM IM LC LC Louv. Louv. Newm. Newm. MCL MCL Metis Metis Ann. Ann 0.0 0.2 0.4 0.6 0.8 1.0 Probabilistic-SVM cross-validation outcome with 11 structural classes. Data: DBLP network. 26
  • 49. Fisher’s discriminant ratio A Separability Framework for Analyzing Community Structure, Bruno Abrahao, Sucheta Soundarajan, John Hopcroft, Robert Kleinberg, To appear in ACM Transactions on Knowledge Discovery from Data (TKDD), 2013 27
  • 50. Fisher’s discriminant ratio A Separability Framework for Analyzing Community Structure, Bruno Abrahao, Sucheta Soundarajan, John Hopcroft, Robert Kleinberg, To appear in ACM Transactions on Knowledge Discovery from Data (TKDD), 2013 27
  • 51. Cross-validation performance indicates class separability BFS BFS RW0 RW0 RW15 RW15 AB AB Structural Class IM IM LC LC Louv. Louv. Newm. Newm. MCL MCL Metis Metis Ann. Ann 0.0 0.2 0.4 0.6 0.8 1.0 Probabilistic-SVM cross-validation outcome with 11 structural classes. Data: DBLP network. 28
  • 52. Cross-validation performance indicates class separability Structural Class 29
  • 53. Classification reveals that annotated resemble unstructured methods grad BFS RW0 Ugrad RW15 SC AB HS Network IM Fly LC Amazon Louv. DBLP Newm. LJ1 MCL LJ2 Metis 0.0 0.2 0.4 0.6 0.8 1.0 Probabilistic-SVM classification of annotated communities into 11 structural classes structural class for 9 different networks. 30
  • 54. Classification reveals that annotated resemble unstructured methods Network 31
  • 55. Can we reveal latent similarities among community detection algorithms? Our framework enables one to cluster algorithms that behave similarly 32
  • 56. Step 1: identifying the most important features 7 features out of 36 retain the discriminative power of the full set 33
  • 57. Grouping algorithms by their tendencies with respect to most discriminative features High Medium Low 34
  • 58. Grouping algorithms by their tendencies with respect to most discriminative features High Medium Low 34
  • 59. Conclusion of methodology • We present a methodology to address the complexity of analyzing community structure, which simultaneously considers – large number of algorithms – multiple domains of application – a broad spectrum of metrics to characterize community structure • A scalable framework that enables – researchers to compare and understand biases of new and existing community detection algorithms – practitioners to decide on the most suitable algorithm for particular purpose and network 35
  • 60. Conclusion of experimental analysis • Our experimental analysis, which include 10 community detection algorithms and 9 different networks analyzed with 36 properties reveals – High variability among the output of community detection methods – Annotated communities have a distinct structure from what we expect • their structure is closer to the output of baseline procedures than to that of popular algorithms – A small set of features explain the biases produced by different algorithms – We can organize the tapestry of available community detection algorithms by grouping them with respect to similarities in behavior 36
  • 61. Final remarks on future directions • Traditional methods are unsupervised – they find a particular type of community – little sensitivity to different purposes, structures of interest and domains of application • Our approach suggests a supervised approach to community detection – user specifies what they intended to find through examples (real or synthetic) – algorithm learns from those examples and retrieves similar structures in the network 37
  • 62. Thank you! On the Separability of Structural Classes of Communities Bruno Abrahao Sucheta Soundarajan John Hopcroft Robert Kleinberg Cornell University 38

Notas del editor

  1. \n
  2. Community structure captures the tendency of entities in a network to group together in meaningful subsets whose members have a distinctive relationship to one another. The identification of these subsets allows for the analysis of networks at different levels of detail, which is instrumental in illuminating the structure underlying large-scale systems.\n
  3. In our KDD-madness video we showed you six communities from a given network, one real and the other 5 extracted using community detection algorithm. We asked you if you could tell which is real. Now I’m going to give you the answer. This is the real one, and here are the names of the procedures that generated the other 5.\n\nThe second question we asked is “what is the difference among the structure of these communities?”. This is the question that motivates our investigation.\n
  4. In our KDD-madness video we showed you six communities from a given network, one real and the other 5 extracted using community detection algorithm. We asked you if you could tell which is real. Now I’m going to give you the answer. This is the real one, and here are the names of the procedures that generated the other 5.\n\nThe second question we asked is “what is the difference among the structure of these communities?”. This is the question that motivates our investigation.\n
  5. In our KDD-madness video we showed you six communities from a given network, one real and the other 5 extracted using community detection algorithm. We asked you if you could tell which is real. Now I’m going to give you the answer. This is the real one, and here are the names of the procedures that generated the other 5.\n\nThe second question we asked is “what is the difference among the structure of these communities?”. This is the question that motivates our investigation.\n
  6. In our KDD-madness video we showed you six communities from a given network, one real and the other 5 extracted using community detection algorithm. We asked you if you could tell which is real. Now I’m going to give you the answer. This is the real one, and here are the names of the procedures that generated the other 5.\n\nThe second question we asked is “what is the difference among the structure of these communities?”. This is the question that motivates our investigation.\n
  7. In our KDD-madness video we showed you six communities from a given network, one real and the other 5 extracted using community detection algorithm. We asked you if you could tell which is real. Now I’m going to give you the answer. This is the real one, and here are the names of the procedures that generated the other 5.\n\nThe second question we asked is “what is the difference among the structure of these communities?”. This is the question that motivates our investigation.\n
  8. In our KDD-madness video we showed you six communities from a given network, one real and the other 5 extracted using community detection algorithm. We asked you if you could tell which is real. Now I’m going to give you the answer. This is the real one, and here are the names of the procedures that generated the other 5.\n\nThe second question we asked is “what is the difference among the structure of these communities?”. This is the question that motivates our investigation.\n
  9. In our KDD-madness video we showed you six communities from a given network, one real and the other 5 extracted using community detection algorithm. We asked you if you could tell which is real. Now I’m going to give you the answer. This is the real one, and here are the names of the procedures that generated the other 5.\n\nThe second question we asked is “what is the difference among the structure of these communities?”. This is the question that motivates our investigation.\n
  10. In our KDD-madness video we showed you six communities from a given network, one real and the other 5 extracted using community detection algorithm. We asked you if you could tell which is real. Now I’m going to give you the answer. This is the real one, and here are the names of the procedures that generated the other 5.\n\nThe second question we asked is “what is the difference among the structure of these communities?”. This is the question that motivates our investigation.\n
  11. In our KDD-madness video we showed you six communities from a given network, one real and the other 5 extracted using community detection algorithm. We asked you if you could tell which is real. Now I’m going to give you the answer. This is the real one, and here are the names of the procedures that generated the other 5.\n\nThe second question we asked is “what is the difference among the structure of these communities?”. This is the question that motivates our investigation.\n
  12. In our KDD-madness video we showed you six communities from a given network, one real and the other 5 extracted using community detection algorithm. We asked you if you could tell which is real. Now I’m going to give you the answer. This is the real one, and here are the names of the procedures that generated the other 5.\n\nThe second question we asked is “what is the difference among the structure of these communities?”. This is the question that motivates our investigation.\n
  13. Given the diverse nature of networks, the notion of meaningful communities is necessarily context dependent, involving interpretations and expectations of domain experts. Therefore, many attempts to define communities are grounded on the notion of mathematical optimization. Starting with an a priori expectation about what a community should look like, researchers specify an objective function for a search method, whose solution for a given instance provides the desired communities. This process has given rise to a a large collection of community detection algorithms, which aim at optimizing various objective functions.\n
  14. Communities in real networks often emerge as a result of multiple driving forces that make up the underlying complex system. Therefore, the attempt to capture community structure by maximizing a given objective function may represent an unrealistic expectation. As a consequence, communities identified by methods that reflect mathematical constructs may differ structurally from real communities that arise in practice. \n\n
  15. There is no established consensus on the question of what properties distinguish subgraphs that are communities from those that are not communities. While we can examine examples of community structure, e.g., by asking experts to identify communities in a given domain, find negative examples of community structure is a challenging task. Any other subset of nodes in the network could potentially be a negative example. In large networks, exhaustively enumerating all forms of negative examples is obviously computationally intractable. Moreover, even if we could enumerate all possible negative examples, we are still faced with the doubt about these seemingly negative-example sets also being valid communities, but only not identified by the expert.\n\n
  16. In this paper, we present a framework to tackle these challenges through a comprehensive analysis of community properties. By using different notions of communities as references, our methodology enables the characterization of community structure without the requiring the identification of negative examples.\n\nOur method presents a scalable framework that enables researchers to understand biases and to assess the structural dissimilarity among the output of of new and existing community detection algorithms, and between the output of algorithms and communities that arise in practice. In addition, the framework serves as a tool for practitioners to decide on the most suitable algorithm for particular purpose and network. In addition, our method provides us with a way to organize the tapestry of community structure. Given the availability of a collection containing numerous algorithms in the literature , we can group those that produce similar and separate those that produce fundamentally different structures. Finally, we are able to identify what graph-theoretical properties of a subgraph are the most discriminative of community signature and what are the properties that the different community detection algorithm load their biases on. \n\n\n
  17. We frame our approach as a class separability problem, which simultaneously handles many classes of communities and a diverse set of structural properties. To this end, we specify a learning problem in which we map the distinct communities into a feature space, where the dimensions represent measures that characterize a community's link structure. The separability of classes provides information on the extent to which different communities come from the same (or fundamentally different) distributions of feature values. \n\n
  18. We frame our approach as a class separability problem, which simultaneously handles many classes of communities and a diverse set of structural properties. To this end, we specify a learning problem in which we map the distinct communities into a feature space, where the dimensions represent measures that characterize a community's link structure. The separability of classes provides information on the extent to which different communities come from the same (or fundamentally different) distributions of feature values. \n\n
  19. \n
  20. \n
  21. \n
  22. \n
  23. \n
  24. \n
  25. \n
  26. \n
  27. \n
  28. \n
  29. \n
  30. \n
  31. The separability of these classes demonstrates the extent to which different algorithms output structurally distinguishable subgraphs. A feature selection analysis can then be employed to highlight the properties that exhibit the highest degree of inter-class variability, thereby making explicit the structural bias produced by different algorithms. \n The separability of the class comprising annotated communities from the classes of intrinsically-defined communities determines the extent to which community detection algorithms succeed in extracting subgraphs that are structurally comparable to the communities formed by nodes sharing extrinsic properties in common.\n\n
  32. The separability of these classes demonstrates the extent to which different algorithms output structurally distinguishable subgraphs. A feature selection analysis can then be employed to highlight the properties that exhibit the highest degree of inter-class variability, thereby making explicit the structural bias produced by different algorithms. \n The separability of the class comprising annotated communities from the classes of intrinsically-defined communities determines the extent to which community detection algorithms succeed in extracting subgraphs that are structurally comparable to the communities formed by nodes sharing extrinsic properties in common.\n\n
  33. The separability of these classes demonstrates the extent to which different algorithms output structurally distinguishable subgraphs. A feature selection analysis can then be employed to highlight the properties that exhibit the highest degree of inter-class variability, thereby making explicit the structural bias produced by different algorithms. \n The separability of the class comprising annotated communities from the classes of intrinsically-defined communities determines the extent to which community detection algorithms succeed in extracting subgraphs that are structurally comparable to the communities formed by nodes sharing extrinsic properties in common.\n\n
  34. We also analyze community structure from a diverse collection of large scale real networks whose domains span biology, on-line shopping, and social systems.\n\n
  35. We furnish our framework with a large set of structural properties and ten different community detection procedures to produce examples of different structural classes. Our selection is representative of categories of popular algorithms available in the literature. We define the first set of communities by properties intrinsic to their link structure. For our purposes, these are the sets that community detection algorithms may output. Each class of intrinsically defined communities comprises a set of examples that a specific algorithm extracts. \n\n\n
  36. We also define communities by the context, the dynamics, or the function associated with the networks, but extrinsic to the link structure. We identify these communities through meaningful annotations provided with the datasets, such as explicit declaration of membership, product categories, grouping by protein function, and so on. In this fashion, for each network, we form a class of extrinsically-defined communities, henceforth called \\emph{annotated communities}. These communities enable a large-scale rigorous analysis of community detection methods. \n\n
  37. We also define communities by the context, the dynamics, or the function associated with the networks, but extrinsic to the link structure. We identify these communities through meaningful annotations provided with the datasets, such as explicit declaration of membership, product categories, grouping by protein function, and so on. In this fashion, for each network, we form a class of extrinsically-defined communities, henceforth called \\emph{annotated communities}. These communities enable a large-scale rigorous analysis of community detection methods. \n\n
  38. We also define communities by the context, the dynamics, or the function associated with the networks, but extrinsic to the link structure. We identify these communities through meaningful annotations provided with the datasets, such as explicit declaration of membership, product categories, grouping by protein function, and so on. In this fashion, for each network, we form a class of extrinsically-defined communities, henceforth called \\emph{annotated communities}. These communities enable a large-scale rigorous analysis of community detection methods. \n\n
  39. We also define communities by the context, the dynamics, or the function associated with the networks, but extrinsic to the link structure. We identify these communities through meaningful annotations provided with the datasets, such as explicit declaration of membership, product categories, grouping by protein function, and so on. In this fashion, for each network, we form a class of extrinsically-defined communities, henceforth called \\emph{annotated communities}. These communities enable a large-scale rigorous analysis of community detection methods. \n\n
  40. We also define communities by the context, the dynamics, or the function associated with the networks, but extrinsic to the link structure. We identify these communities through meaningful annotations provided with the datasets, such as explicit declaration of membership, product categories, grouping by protein function, and so on. In this fashion, for each network, we form a class of extrinsically-defined communities, henceforth called \\emph{annotated communities}. These communities enable a large-scale rigorous analysis of community detection methods. \n\n
  41. \n
  42. \n
  43. Use the performance of existing probabilistic classifiers as a measure of separability\n
  44. \n
  45. \n
  46. \n
  47. \n
  48. \n
  49. \n
  50. \n
  51. \n
  52. \n
  53. \n
  54. \n
  55. \n
  56. \n
  57. \n
  58. Finally, our method provides us with a way to organize the tapestry of community structure. Given the availability in the literature of a collection containing numerous algorithms, we can group those that produce similar and separate those that produce fundamentally different structures.\n\n
  59. The first step is to identify what properties of a subgraph are the most discriminative of community signature and what are the properties that the different community detection algorithm heavily load their biases on. \n\n
  60. \n
  61. \n
  62. \n
  63. \n
  64. \n
  65. \n
  66. \n
  67. \n
  68. \n