SlideShare una empresa de Scribd logo
1 de 32
Descargar para leer sin conexión
The Noise Cluster Model:
            a Greedy Solution to the Network Communities
                         Extraction Problem

                                            Etienne Cˆme,
                                                      o
                                            come@inrets.fr
                                                  &
                                           Eustache Diemert,
                                       ediemert@bestofmedia.com


                                            6 octobre 2010



Cˆme & Diemert (INRETS, BestOfMedia)
 o                                          The Noise Cluster Model   6 octobre 2010   1 / 32
Outline




  1   Introduction

  2   Existing solutions for the community extraction problem

  3   Background on Erd¨s-R´nyi mixture
                       o e

  4   The noise cluster model

  5   Community extraction using the noise cluster model

  6   Preliminary experiments : Blogs communities extraction

  7   Conclusion & future works



Cˆme & Diemert (INRETS, BestOfMedia)
 o                                     The Noise Cluster Model   6 octobre 2010   2 / 32
Introduction


 Introduction

  Motivations
         Extract one community using seeds nodes from the community
         On-line algorithm (do not store the full graph)

  Solution : Community extraction
         extract one community
         semi-supervised method : some community members are known

  Solution : Noise cluster model
         simple generative model
         one community surrounded by noise



Cˆme & Diemert (INRETS, BestOfMedia)
 o                                      The Noise Cluster Model   6 octobre 2010   3 / 32
Introduction


 Introduction, (toy example)




Cˆme & Diemert (INRETS, BestOfMedia)
 o                                      The Noise Cluster Model   6 octobre 2010   4 / 32
Introduction


 Introduction, (graph clustering)




Cˆme & Diemert (INRETS, BestOfMedia)
 o                                      The Noise Cluster Model   6 octobre 2010   5 / 32
Introduction


 Introduction, (seeds)




Cˆme & Diemert (INRETS, BestOfMedia)
 o                                      The Noise Cluster Model   6 octobre 2010   6 / 32
Introduction


 Introduction, (community extraction)




Cˆme & Diemert (INRETS, BestOfMedia)
 o                                      The Noise Cluster Model   6 octobre 2010   7 / 32
Introduction


 Introduction, (community extraction)



                 Usefull



                                          Useless




Cˆme & Diemert (INRETS, BestOfMedia)
 o                                      The Noise Cluster Model   6 octobre 2010   8 / 32
Introduction


 Advantages




         seeds give a focus to process the graph
         better complexity
         the exploration of the full graph can be avoided
         no problem of balance between communities size




Cˆme & Diemert (INRETS, BestOfMedia)
 o                                      The Noise Cluster Model   6 octobre 2010   9 / 32
Existing solutions for the community extraction problem


 Existing solutions for the community extraction problem


  Bagrow & al [BB05]
         growing a breadth first tree outward from one seed node ;
         until the rate of expansion falls below an arbitrary threshold. (i.e. the
         proportion of edges found at the current level which lead to nodes
         which are yet unknown)

  Problems
         can only deal with one seed
         all node of a level are included




Cˆme & Diemert (INRETS, BestOfMedia)
 o                                                 The Noise Cluster Model   6 octobre 2010   10 / 32
Existing solutions for the community extraction problem


 Existing solutions for the community extraction problem
  Clauset [Cla05]
         greedy optimization of a quantity called local modularity Lmod ;
         boundary B : the subset of known nodes that have at least one
         neighbour in the set of yet unknown nodes ;
         local modularity : number of edges between this set and the set of
         known nodes C over the total number of edges with one extremity in
         this set.
                                 i∈C,j∈B Bij +       i∈B,j∈C Bij
                       Lmod =                                    ,          (1)
                                             i,j Bij

  with Bij = 1 if i and j are connected and either vertex is in B.

  Problems
         can only deal with one seed
         stopping criteria tuning
Cˆme & Diemert (INRETS, BestOfMedia)
 o                                                 The Noise Cluster Model   6 octobre 2010   11 / 32
Existing solutions for the community extraction problem


 Existing solutions for the community extraction problem



  Other solutions
         [AL06] random walks and conductances
         [SG10] combinatorial algorithms

  Problems
         complexity scales with the graph size




Cˆme & Diemert (INRETS, BestOfMedia)
 o                                                 The Noise Cluster Model   6 octobre 2010   12 / 32
Background on Erd¨s-R´nyi mixture
                                         o e


 Graph clustering

  Generative setting (Erd¨s-R´nyi mixture, block-model)
                         o e
  Variables definition :
         Xij are binary variables defining presence // absence of link from node
         i to node j :

                                         1, if there is a link from i to j
                              xij =                                                           (2)
                                         0, otherwise.

         Zjk are dummy variables encoding cluster membership, they take their
         values zjk :
                                1, if j belongs to cluster k
                       zjk =                                              (3)
                                0, otherwise.


Cˆme & Diemert (INRETS, BestOfMedia)
 o                                             The Noise Cluster Model       6 octobre 2010   13 / 32
Background on Erd¨s-R´nyi mixture
                                         o e


 Erd¨s-R´nyi mixture
    o e




  Model definition [DPS08]

                                               i.i.d
                                       Zjk      ∼       M(1, γ),           ∀i ∈ {1, . . . , N}             (4)
                                               i.i.d
                  Xij |Zik × Zjl = 1            ∼       B(πkl ),         ∀i, j ∈ {1, . . . , N}            (5)




Cˆme & Diemert (INRETS, BestOfMedia)
 o                                             The Noise Cluster Model                    6 octobre 2010   14 / 32
Background on Erd¨s-R´nyi mixture
                                         o e


 Erd¨s-R´nyi mixture
    o e




           Figure: Adjacency matrix simulated using an Erd¨s-R´nyi mixture
                                                          o e


Cˆme & Diemert (INRETS, BestOfMedia)
 o                                             The Noise Cluster Model   6 octobre 2010   15 / 32
The noise cluster model


 The noise cluster model



  Model definition

                                              i.i.d
                                       Zi      ∼       B(γ),       ∀i ∈ {1, . . . , N},                (6)
                                              i.i.d
                   Xij |Zi × Zj = 1            ∼       B(α),       ∀i, j ∈ {1, . . . , N},             (7)
                                              i.i.d
                   Xij |Zi × Zj = 0            ∼       B(β),       ∀i, j ∈ {1, . . . , N},             (8)

  with zi = 1, if i belongs to the community and 0 otherwise.




Cˆme & Diemert (INRETS, BestOfMedia)
 o                                            The Noise Cluster Model                 6 octobre 2010   16 / 32
The noise cluster model


 The noise cluster model




           Figure: Adjacency matrix simulated using the noise cluster model.


Cˆme & Diemert (INRETS, BestOfMedia)
 o                                            The Noise Cluster Model   6 octobre 2010   17 / 32
The noise cluster model




  Basics quantities
         Community size :
                                                     Nc =               zi
                                                                  i

         Nodes degrees :

                   djin =              xij ,   djout =                xji ,   dj =         (xij + xji )
                            i:zi =1                         i:zi =1                  i:zi =1

         Posteriors probabilities :

                pjin = P(Zj = 1|Xij = xij , Zi = zi , ∀i ∈ {1, . . . , N}),
              pjout    = P(Zj = 1|Xji = xji , Zi = zi , ∀i ∈ {1, . . . , N}),
           pjin,out    = P(Zj = 1|Xij = xij , Xji = xji , Zi = zi , ∀i ∈ {1, . . . , N}),



Cˆme & Diemert (INRETS, BestOfMedia)
 o                                             The Noise Cluster Model                         6 octobre 2010   18 / 32
The noise cluster model




  Simplifications :
  Community membership posterior probabilities are the quantities of
  interest to determine if a node must be added to the community. They
  depend uniquely on :
         parameters (α, β, γ) ;
         links with community members (djin , djout , djin,out respectively) ;
         community size (Nc) ;

  Example for pjin

                                                 in                     in
                                            αdj × (1 − α)(Nc−dj ) × γ
     pjin =           in                        in                in          in
                  αdj × (1 − α)(Nc−dj ) × γ + β dj × (1 − β)(Nc−dj ) × (1 − γ)



Cˆme & Diemert (INRETS, BestOfMedia)
 o                                            The Noise Cluster Model        6 octobre 2010   19 / 32
The noise cluster model




  Community membership test
  Community membership test equivalent to threshold the number of shared
  links with community members.

                                       {pjin > s} ⇔ {djin > dmin },                      (9)

  with

                log s × (1 − β)Nc × (1 − γ) − log (1 − s) × (1 − α)Nc × γ
   dmin =
                            log (α × (1 − β)) − log ((1 − α) × β)




Cˆme & Diemert (INRETS, BestOfMedia)
 o                                            The Noise Cluster Model   6 octobre 2010   20 / 32
The noise cluster model


                                                        alpha=0.1,beta=0.001,gamma=0.05,Nc=200

                                                       qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq




                                  0.8
                           pc

                                  0.4
                                                   q




                                  0.0
                                            qqqq


                                        0                 10           20          30           40   50

                                                                             din



                                                               alpha=0.1,beta=0.001,gamma=0.05
                                  10
                                  8
                           dmin

                                  6
                                  4
                                  2




                                            0                   100          200          300        400

                                                                             Nc




  Figure: (top) values of pjin with respect to djin with α = 0.1,
  β = 0.001, γ = 0.05 and Nc = 200 ; (bottom) dmin evolution with respect to the
  community size Nc with α = 0.1, β = 0.001, γ = 0.05 and s = 0.5.

Cˆme & Diemert (INRETS, BestOfMedia)
 o                                                              The Noise Cluster Model                    6 octobre 2010   21 / 32
Community extraction using the noise cluster model


 Online learning [ZAM08]

  Classification likelihood
  In the case of a full adjacency matrix, the classification log-likelihood is
  defined as :

     Lc (X, Z, θ) =                zi log(γ) +              (1 − zi ) log(1 − γ)
                               i                       i

            +               zi × zj × xij log(α) +                     zi × zj (1 − ×xij ) log(1 − α)
                  i,j:i=j                                    i,j:i=j

   +             (1 − zi × zj ) × xij log(β) +                       (1 − zi × zj ) × (1 − xij ) log(1 − β)
       i,j:i=j                                             i,j:i=j

  with Z = {z1 , . . . , zN }, X = {xij : i = j, i, j ∈ {1, . . . , N}}, and
  θ = (γ, α, β) the parameters vector.


Cˆme & Diemert (INRETS, BestOfMedia)
 o                                                The Noise Cluster Model                   6 octobre 2010   22 / 32
Community extraction using the noise cluster model


 Online learning [ZAM08]
  Maximisation for known partition
  If the partition Z = {z1 , . . . , zN } is known and with a square adjacency
  matrix of size N × N, the parameter vector maximizing the Classification
  likelihood is given by :
                                    Nc
                       γ =
                       ˆ               ,                                                                       (10)
                                    N
                                               N
                                     1
                       α =
                       ˆ              2
                                                       (zi × zj )xij ,                                         (11)
                                    Nc
                                          i,j=1, i=j
                                                                   N
                       ˆ                  1
                       β =                                                   (1 − zi × zj )xij ,               (12)
                                    Nc × (N + Nc )
                                     ¯
                                                               i,j=1, i=j

  with Nc the number of nodes belonging to the community and Nc the
                                                              ¯
  number of nodes that do not belong to the community.
Cˆme & Diemert (INRETS, BestOfMedia)
 o                                                 The Noise Cluster Model                    6 octobre 2010    23 / 32
Community extraction using the noise cluster model


 Proposed community extraction procedure



  Algorithm
  Use a breadth first algorithm to explore the graph starting from the seeds,
  for each traversed vertex :
     1   use community membership test (9) to add it or not to the
         community
     2   update parameters (using 10, 11, 12), taking into account the current
         partition
  until no more vertex can be added to the community.




Cˆme & Diemert (INRETS, BestOfMedia)
 o                                               The Noise Cluster Model   6 octobre 2010   24 / 32
Preliminary experiments : Blogs communities extraction


 Preliminary experiments : Blogs communities extraction



  Settings
         multi-threaded web crawler coupled with the proposed community
         extraction procedure ;
         seeds URLs taken from Wikio (http ://www.wikio.com) which
         proposes several rankings of blogs for several topics ;
         theses ranking were used to provide 100 or 50 seeds to the algorithm
         for 4 test communities.




Cˆme & Diemert (INRETS, BestOfMedia)
 o                                                The Noise Cluster Model   6 octobre 2010   25 / 32
Preliminary experiments : Blogs communities extraction


 Blogs communities extraction


                                 Comics (Fr)        Scrapbooking (Fr)       Food (U.S.)   Politics (U.S.)
       Nb seed                   100                100                     50            50
       Nc                        1 263              1 130                   1 681         1 884
       Nb edges                  20 434             24 248                  100 597       74 219
       α                         0.01821            0.01899                 0.03560       0.02091
       β                         0.00093            0.00147                 0.00091       0.00065
       γ                         0.03048            0.05579                 0.03060       0.01808
       Biggest S.C.C.            1 251              1 129                   1 667         1 877
       Max Level                 3                  2                       5             4
       Diameter                  6                  7                       7             8
       Radius                    4                  4                       4             3
       Clustering Coeff.          0.287              0.265                   0.381         0.320
       Transitivity              0.198              0.2                     0.290         0.223

            Table: Global statistics and model parameters for 4 communities.




Cˆme & Diemert (INRETS, BestOfMedia)
 o                                                The Noise Cluster Model                 6 octobre 2010    26 / 32
Preliminary experiments : Blogs communities extraction


 Blogs community extraction

                                    names                                   level
                             1      www.bouletcorp.com                      0
                             2      louromano.blogspot.com                  2
                             3      www.cartoonbrew.com                     2
                             4      yacinfields.blogspot.com                 1
                             5      polyminthe.blogspot.com                 1
                             6      marnette.canalblog.com                  1
                             7      blackwingdiaries.blogspot.com           2
                             8      bastienvives.blogspot.com               1
                             9      donshank.blogspot.com                   2
                            10      john-nevarez.blogspot.com               2
     Table: Best site according to local page rank for the Comics (fr) community



Cˆme & Diemert (INRETS, BestOfMedia)
 o                                                The Noise Cluster Model           6 octobre 2010   27 / 32
Preliminary experiments : Blogs communities extraction




  Figure: Word clouds for Politics (us). The first 50 words in descending order of
  their Kullback-Leibler divergence are kept(between word document frequency in
  the community and in a negative class of 10000 random blogs, texts have been
  first preprocessed using a stop list and stemming). Words size are proportional to
  the word document frequencies in the community.


Cˆme & Diemert (INRETS, BestOfMedia)
 o                                                The Noise Cluster Model   6 octobre 2010   28 / 32
Preliminary experiments : Blogs communities extraction




  Figure: Word clouds for Food (us). The first 50 words in descending order of
  their Kullback-Leibler divergence are kept(between word document frequency in
  the community and in a negative class of 10000 random blogs, texts have been
  first preprocessed using a stop list and stemming). Words size are proportional to
  the word document frequencies in the community.


Cˆme & Diemert (INRETS, BestOfMedia)
 o                                                The Noise Cluster Model   6 octobre 2010   29 / 32
Conclusion & future works


 Conclusion & future works

  Conclusion
         simple, greedy approach ;
         complexity scales with the community size not the graph size ;
         blog community extraction was performed using such a tool with
         success.

  Future works
  More work is needed to better understand and evaluate the approach :
         test the robustness of the methods to noise in the seeds set ;
         test with other application domains (with ground truth) ;
         test using graph generation algorithms.



Cˆme & Diemert (INRETS, BestOfMedia)
 o                                             The Noise Cluster Model   6 octobre 2010   30 / 32
Conclusion & future works



        R. Andersen and K. Lang.
        Communities from seed sets.
        In Proceedings of the 15th International Conference on World Wide Web, pages 223–232.
        ACM Press, 2006.
        J.P. Bagrow and E.M. Bollt.
        A local method for detecting communities.
        Phys Rev E Stat Nonlin Soft Matter Phys, 72(4) :046108, 2005.

        A. Clauset.
        Finding local community structure in networks.
        Phys Rev E Stat Nonlin Soft Matter Phys, 72(2) :026132, 2005.

        J. Daudin, F. Picard, and Robin S.
        A mixture model for random graph.
        Statistics and computing, 18 :1–36, 2008.

        M. Sozio and A. Gionis.
        The community-search problem and how to plan a successful cocktail party.
        In Proceedings of the 16th ACM SIGKDD Conference On Knowledge Discovery and Data
        Mining (KDD), pages –, 2010.

        H. Zanghi, C. Ambroise, and V. Miele.
        Fast online graph clustering via erdos-renyi mixture.
        Pattern Recognition, 41(12) :3592–3599, December 2008.


Cˆme & Diemert (INRETS, BestOfMedia)
 o                                             The Noise Cluster Model     6 octobre 2010   31 / 32
Conclusion & future works




              Thanks for your attention !




Cˆme & Diemert (INRETS, BestOfMedia)
 o                                             The Noise Cluster Model   6 octobre 2010   32 / 32

Más contenido relacionado

Destacado

Marne IFSTTAR
Marne IFSTTARMarne IFSTTAR
Marne IFSTTARticien
 
animatics
animaticsanimatics
animaticsticien
 
Seminaire Ercim
Seminaire ErcimSeminaire Ercim
Seminaire Ercimticien
 
Presentation egc
Presentation egcPresentation egc
Presentation egcticien
 
Presentation Tisic 2011
Presentation Tisic 2011Presentation Tisic 2011
Presentation Tisic 2011ticien
 

Destacado (6)

Tisyc
TisycTisyc
Tisyc
 
Marne IFSTTAR
Marne IFSTTARMarne IFSTTAR
Marne IFSTTAR
 
animatics
animaticsanimatics
animatics
 
Seminaire Ercim
Seminaire ErcimSeminaire Ercim
Seminaire Ercim
 
Presentation egc
Presentation egcPresentation egc
Presentation egc
 
Presentation Tisic 2011
Presentation Tisic 2011Presentation Tisic 2011
Presentation Tisic 2011
 

Marami 2010

  • 1. The Noise Cluster Model: a Greedy Solution to the Network Communities Extraction Problem Etienne Cˆme, o come@inrets.fr & Eustache Diemert, ediemert@bestofmedia.com 6 octobre 2010 Cˆme & Diemert (INRETS, BestOfMedia) o The Noise Cluster Model 6 octobre 2010 1 / 32
  • 2. Outline 1 Introduction 2 Existing solutions for the community extraction problem 3 Background on Erd¨s-R´nyi mixture o e 4 The noise cluster model 5 Community extraction using the noise cluster model 6 Preliminary experiments : Blogs communities extraction 7 Conclusion & future works Cˆme & Diemert (INRETS, BestOfMedia) o The Noise Cluster Model 6 octobre 2010 2 / 32
  • 3. Introduction Introduction Motivations Extract one community using seeds nodes from the community On-line algorithm (do not store the full graph) Solution : Community extraction extract one community semi-supervised method : some community members are known Solution : Noise cluster model simple generative model one community surrounded by noise Cˆme & Diemert (INRETS, BestOfMedia) o The Noise Cluster Model 6 octobre 2010 3 / 32
  • 4. Introduction Introduction, (toy example) Cˆme & Diemert (INRETS, BestOfMedia) o The Noise Cluster Model 6 octobre 2010 4 / 32
  • 5. Introduction Introduction, (graph clustering) Cˆme & Diemert (INRETS, BestOfMedia) o The Noise Cluster Model 6 octobre 2010 5 / 32
  • 6. Introduction Introduction, (seeds) Cˆme & Diemert (INRETS, BestOfMedia) o The Noise Cluster Model 6 octobre 2010 6 / 32
  • 7. Introduction Introduction, (community extraction) Cˆme & Diemert (INRETS, BestOfMedia) o The Noise Cluster Model 6 octobre 2010 7 / 32
  • 8. Introduction Introduction, (community extraction) Usefull Useless Cˆme & Diemert (INRETS, BestOfMedia) o The Noise Cluster Model 6 octobre 2010 8 / 32
  • 9. Introduction Advantages seeds give a focus to process the graph better complexity the exploration of the full graph can be avoided no problem of balance between communities size Cˆme & Diemert (INRETS, BestOfMedia) o The Noise Cluster Model 6 octobre 2010 9 / 32
  • 10. Existing solutions for the community extraction problem Existing solutions for the community extraction problem Bagrow & al [BB05] growing a breadth first tree outward from one seed node ; until the rate of expansion falls below an arbitrary threshold. (i.e. the proportion of edges found at the current level which lead to nodes which are yet unknown) Problems can only deal with one seed all node of a level are included Cˆme & Diemert (INRETS, BestOfMedia) o The Noise Cluster Model 6 octobre 2010 10 / 32
  • 11. Existing solutions for the community extraction problem Existing solutions for the community extraction problem Clauset [Cla05] greedy optimization of a quantity called local modularity Lmod ; boundary B : the subset of known nodes that have at least one neighbour in the set of yet unknown nodes ; local modularity : number of edges between this set and the set of known nodes C over the total number of edges with one extremity in this set. i∈C,j∈B Bij + i∈B,j∈C Bij Lmod = , (1) i,j Bij with Bij = 1 if i and j are connected and either vertex is in B. Problems can only deal with one seed stopping criteria tuning Cˆme & Diemert (INRETS, BestOfMedia) o The Noise Cluster Model 6 octobre 2010 11 / 32
  • 12. Existing solutions for the community extraction problem Existing solutions for the community extraction problem Other solutions [AL06] random walks and conductances [SG10] combinatorial algorithms Problems complexity scales with the graph size Cˆme & Diemert (INRETS, BestOfMedia) o The Noise Cluster Model 6 octobre 2010 12 / 32
  • 13. Background on Erd¨s-R´nyi mixture o e Graph clustering Generative setting (Erd¨s-R´nyi mixture, block-model) o e Variables definition : Xij are binary variables defining presence // absence of link from node i to node j : 1, if there is a link from i to j xij = (2) 0, otherwise. Zjk are dummy variables encoding cluster membership, they take their values zjk : 1, if j belongs to cluster k zjk = (3) 0, otherwise. Cˆme & Diemert (INRETS, BestOfMedia) o The Noise Cluster Model 6 octobre 2010 13 / 32
  • 14. Background on Erd¨s-R´nyi mixture o e Erd¨s-R´nyi mixture o e Model definition [DPS08] i.i.d Zjk ∼ M(1, γ), ∀i ∈ {1, . . . , N} (4) i.i.d Xij |Zik × Zjl = 1 ∼ B(πkl ), ∀i, j ∈ {1, . . . , N} (5) Cˆme & Diemert (INRETS, BestOfMedia) o The Noise Cluster Model 6 octobre 2010 14 / 32
  • 15. Background on Erd¨s-R´nyi mixture o e Erd¨s-R´nyi mixture o e Figure: Adjacency matrix simulated using an Erd¨s-R´nyi mixture o e Cˆme & Diemert (INRETS, BestOfMedia) o The Noise Cluster Model 6 octobre 2010 15 / 32
  • 16. The noise cluster model The noise cluster model Model definition i.i.d Zi ∼ B(γ), ∀i ∈ {1, . . . , N}, (6) i.i.d Xij |Zi × Zj = 1 ∼ B(α), ∀i, j ∈ {1, . . . , N}, (7) i.i.d Xij |Zi × Zj = 0 ∼ B(β), ∀i, j ∈ {1, . . . , N}, (8) with zi = 1, if i belongs to the community and 0 otherwise. Cˆme & Diemert (INRETS, BestOfMedia) o The Noise Cluster Model 6 octobre 2010 16 / 32
  • 17. The noise cluster model The noise cluster model Figure: Adjacency matrix simulated using the noise cluster model. Cˆme & Diemert (INRETS, BestOfMedia) o The Noise Cluster Model 6 octobre 2010 17 / 32
  • 18. The noise cluster model Basics quantities Community size : Nc = zi i Nodes degrees : djin = xij , djout = xji , dj = (xij + xji ) i:zi =1 i:zi =1 i:zi =1 Posteriors probabilities : pjin = P(Zj = 1|Xij = xij , Zi = zi , ∀i ∈ {1, . . . , N}), pjout = P(Zj = 1|Xji = xji , Zi = zi , ∀i ∈ {1, . . . , N}), pjin,out = P(Zj = 1|Xij = xij , Xji = xji , Zi = zi , ∀i ∈ {1, . . . , N}), Cˆme & Diemert (INRETS, BestOfMedia) o The Noise Cluster Model 6 octobre 2010 18 / 32
  • 19. The noise cluster model Simplifications : Community membership posterior probabilities are the quantities of interest to determine if a node must be added to the community. They depend uniquely on : parameters (α, β, γ) ; links with community members (djin , djout , djin,out respectively) ; community size (Nc) ; Example for pjin in in αdj × (1 − α)(Nc−dj ) × γ pjin = in in in in αdj × (1 − α)(Nc−dj ) × γ + β dj × (1 − β)(Nc−dj ) × (1 − γ) Cˆme & Diemert (INRETS, BestOfMedia) o The Noise Cluster Model 6 octobre 2010 19 / 32
  • 20. The noise cluster model Community membership test Community membership test equivalent to threshold the number of shared links with community members. {pjin > s} ⇔ {djin > dmin }, (9) with log s × (1 − β)Nc × (1 − γ) − log (1 − s) × (1 − α)Nc × γ dmin = log (α × (1 − β)) − log ((1 − α) × β) Cˆme & Diemert (INRETS, BestOfMedia) o The Noise Cluster Model 6 octobre 2010 20 / 32
  • 21. The noise cluster model alpha=0.1,beta=0.001,gamma=0.05,Nc=200 qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq 0.8 pc 0.4 q 0.0 qqqq 0 10 20 30 40 50 din alpha=0.1,beta=0.001,gamma=0.05 10 8 dmin 6 4 2 0 100 200 300 400 Nc Figure: (top) values of pjin with respect to djin with α = 0.1, β = 0.001, γ = 0.05 and Nc = 200 ; (bottom) dmin evolution with respect to the community size Nc with α = 0.1, β = 0.001, γ = 0.05 and s = 0.5. Cˆme & Diemert (INRETS, BestOfMedia) o The Noise Cluster Model 6 octobre 2010 21 / 32
  • 22. Community extraction using the noise cluster model Online learning [ZAM08] Classification likelihood In the case of a full adjacency matrix, the classification log-likelihood is defined as : Lc (X, Z, θ) = zi log(γ) + (1 − zi ) log(1 − γ) i i + zi × zj × xij log(α) + zi × zj (1 − ×xij ) log(1 − α) i,j:i=j i,j:i=j + (1 − zi × zj ) × xij log(β) + (1 − zi × zj ) × (1 − xij ) log(1 − β) i,j:i=j i,j:i=j with Z = {z1 , . . . , zN }, X = {xij : i = j, i, j ∈ {1, . . . , N}}, and θ = (γ, α, β) the parameters vector. Cˆme & Diemert (INRETS, BestOfMedia) o The Noise Cluster Model 6 octobre 2010 22 / 32
  • 23. Community extraction using the noise cluster model Online learning [ZAM08] Maximisation for known partition If the partition Z = {z1 , . . . , zN } is known and with a square adjacency matrix of size N × N, the parameter vector maximizing the Classification likelihood is given by : Nc γ = ˆ , (10) N N 1 α = ˆ 2 (zi × zj )xij , (11) Nc i,j=1, i=j N ˆ 1 β = (1 − zi × zj )xij , (12) Nc × (N + Nc ) ¯ i,j=1, i=j with Nc the number of nodes belonging to the community and Nc the ¯ number of nodes that do not belong to the community. Cˆme & Diemert (INRETS, BestOfMedia) o The Noise Cluster Model 6 octobre 2010 23 / 32
  • 24. Community extraction using the noise cluster model Proposed community extraction procedure Algorithm Use a breadth first algorithm to explore the graph starting from the seeds, for each traversed vertex : 1 use community membership test (9) to add it or not to the community 2 update parameters (using 10, 11, 12), taking into account the current partition until no more vertex can be added to the community. Cˆme & Diemert (INRETS, BestOfMedia) o The Noise Cluster Model 6 octobre 2010 24 / 32
  • 25. Preliminary experiments : Blogs communities extraction Preliminary experiments : Blogs communities extraction Settings multi-threaded web crawler coupled with the proposed community extraction procedure ; seeds URLs taken from Wikio (http ://www.wikio.com) which proposes several rankings of blogs for several topics ; theses ranking were used to provide 100 or 50 seeds to the algorithm for 4 test communities. Cˆme & Diemert (INRETS, BestOfMedia) o The Noise Cluster Model 6 octobre 2010 25 / 32
  • 26. Preliminary experiments : Blogs communities extraction Blogs communities extraction Comics (Fr) Scrapbooking (Fr) Food (U.S.) Politics (U.S.) Nb seed 100 100 50 50 Nc 1 263 1 130 1 681 1 884 Nb edges 20 434 24 248 100 597 74 219 α 0.01821 0.01899 0.03560 0.02091 β 0.00093 0.00147 0.00091 0.00065 γ 0.03048 0.05579 0.03060 0.01808 Biggest S.C.C. 1 251 1 129 1 667 1 877 Max Level 3 2 5 4 Diameter 6 7 7 8 Radius 4 4 4 3 Clustering Coeff. 0.287 0.265 0.381 0.320 Transitivity 0.198 0.2 0.290 0.223 Table: Global statistics and model parameters for 4 communities. Cˆme & Diemert (INRETS, BestOfMedia) o The Noise Cluster Model 6 octobre 2010 26 / 32
  • 27. Preliminary experiments : Blogs communities extraction Blogs community extraction names level 1 www.bouletcorp.com 0 2 louromano.blogspot.com 2 3 www.cartoonbrew.com 2 4 yacinfields.blogspot.com 1 5 polyminthe.blogspot.com 1 6 marnette.canalblog.com 1 7 blackwingdiaries.blogspot.com 2 8 bastienvives.blogspot.com 1 9 donshank.blogspot.com 2 10 john-nevarez.blogspot.com 2 Table: Best site according to local page rank for the Comics (fr) community Cˆme & Diemert (INRETS, BestOfMedia) o The Noise Cluster Model 6 octobre 2010 27 / 32
  • 28. Preliminary experiments : Blogs communities extraction Figure: Word clouds for Politics (us). The first 50 words in descending order of their Kullback-Leibler divergence are kept(between word document frequency in the community and in a negative class of 10000 random blogs, texts have been first preprocessed using a stop list and stemming). Words size are proportional to the word document frequencies in the community. Cˆme & Diemert (INRETS, BestOfMedia) o The Noise Cluster Model 6 octobre 2010 28 / 32
  • 29. Preliminary experiments : Blogs communities extraction Figure: Word clouds for Food (us). The first 50 words in descending order of their Kullback-Leibler divergence are kept(between word document frequency in the community and in a negative class of 10000 random blogs, texts have been first preprocessed using a stop list and stemming). Words size are proportional to the word document frequencies in the community. Cˆme & Diemert (INRETS, BestOfMedia) o The Noise Cluster Model 6 octobre 2010 29 / 32
  • 30. Conclusion & future works Conclusion & future works Conclusion simple, greedy approach ; complexity scales with the community size not the graph size ; blog community extraction was performed using such a tool with success. Future works More work is needed to better understand and evaluate the approach : test the robustness of the methods to noise in the seeds set ; test with other application domains (with ground truth) ; test using graph generation algorithms. Cˆme & Diemert (INRETS, BestOfMedia) o The Noise Cluster Model 6 octobre 2010 30 / 32
  • 31. Conclusion & future works R. Andersen and K. Lang. Communities from seed sets. In Proceedings of the 15th International Conference on World Wide Web, pages 223–232. ACM Press, 2006. J.P. Bagrow and E.M. Bollt. A local method for detecting communities. Phys Rev E Stat Nonlin Soft Matter Phys, 72(4) :046108, 2005. A. Clauset. Finding local community structure in networks. Phys Rev E Stat Nonlin Soft Matter Phys, 72(2) :026132, 2005. J. Daudin, F. Picard, and Robin S. A mixture model for random graph. Statistics and computing, 18 :1–36, 2008. M. Sozio and A. Gionis. The community-search problem and how to plan a successful cocktail party. In Proceedings of the 16th ACM SIGKDD Conference On Knowledge Discovery and Data Mining (KDD), pages –, 2010. H. Zanghi, C. Ambroise, and V. Miele. Fast online graph clustering via erdos-renyi mixture. Pattern Recognition, 41(12) :3592–3599, December 2008. Cˆme & Diemert (INRETS, BestOfMedia) o The Noise Cluster Model 6 octobre 2010 31 / 32
  • 32. Conclusion & future works Thanks for your attention ! Cˆme & Diemert (INRETS, BestOfMedia) o The Noise Cluster Model 6 octobre 2010 32 / 32