UNIT – IV_PCI Complaints: Complaints and evaluation of complaints, Handling o...
Personalized PageRank based community detection
1. Personalized !
PageRank based
Community Detection
David F. Gleich!
Purdue University!
Joint work with C. Seshadhri,
Joyce Jiyoung Whang, and
Inderjit S. Dhillon, supported by
NSF CAREER 1149756-CCF
Code bit.ly/dgleich-codes!
2. Today’s talk
1. Personalized PageRank"
based community detection
2. Conductance, Egonets, and "
Network Community Profiles
3. Egonet seeding
4. Improved seeding
David Gleich · Purdue
2
MLG2013
3. A community is a set of
vertices that is denser inside
than out.
David Gleich · Purdue
3
MLG2013
6. We can find communities using
Personalized PageRank (PPR)
[Andersen et al. 2006]
PPR is a Markov chain on nodes
1. with probability 𝛼, ", "
follow a random edge
2. with probability 1-𝛼, ", "
restart at a seed
aka random surfer
aka random walk with restart
unique stationary distribution
David Gleich · Purdue
6
MLG2013
7. Personalized PageRank
community detection
1. Given a seed, approximate the
stationary distribution.
2. Extract the community.
Both are local operations.
David Gleich · Purdue
7
MLG2013
9. Conductance communities
Conductance is one of the most
important community scores [Schaeffer07]
The conductance of a set of vertices is
the ratio of edges leaving to total edges:
Equivalently, it’s the probability that a
random edge leaves the set.
Small conductance ó Good community
(S) =
cut(S)
min vol(S), vol( ¯S)
(edges leaving the set)
(total edges
in the set)
David Gleich · Purdue
cut(S) = 7
vol(S) = 33
vol( ¯S) = 11
(S) = 7/11
9
MLG2013
11. # G is graph as dictionary-of-sets!
alpha=0.99!
tol=1e-4!
!
x = {} # Store x, r as dictionaries!
r = {} # initialize residual!
Q = collections.deque() # initialize queue!
for s in seed: !
r(s) = 1/len(seed)!
Q.append(s)!
while len(Q) > 0:!
v = Q.popleft() # v has r[v] > tol*deg(v)!
if v not in x: x[v] = 0.!
x[v] += (1-alpha)*r[v]!
mass = alpha*r[v]/(2*len(G[v])) !
for u in G[v]: # for neighbors of u!
if u not in r: r[u] = 0.!
if r[u] < len(G[u])*tol and !
r[u] + mass >= len(G[u])*tol:!
Q.append(u) # add u to queue if large!
r[u] = r[u] + mass!
r[v] = mass*len(G[v]) !
David Gleich · Purdue
11
MLG2013
16. Gleich-Seshadhri, KDD 2012!
Egonets and Conductance
David Gleich · Purdue
Neighborhoods are good communities
^
conductance
^
Vertex
Egonets?
… in graphs that look like social
and information networks
16
MLG2013
17. Vertex neighborhoods or!
Egonets
The induced subgraph of"
set a vertex its neighbors
Prior research on egonets of social networks from
the “structural holes” perspective [Burt95,Kleinberg08].
Used for anomaly detection [Akoglu10], "
community seeds [Huang11,Schaeffer11], "
overlapping communities [Schaeffer07,Rees10].
David Gleich · Purdue
17
MLG2013
18. Simple version of theorem
If global clustering coefficient = 1, then "
the graph is a disjoint union of cliques.
Vertex neighborhoods are optimal communities!
David Gleich · Purdue
18
MLG2013
19. Theorem
Condition Let graph G have
clustering coefficient 𝜅 and "
have vertex degrees bounded "
by a power-law function with
exponent 𝛾 less than 3.
Theorem Then there exists a vertex
neighborhood with conductance
log degree
logprobability
↵1n/d
↵2n/d
4(1 )/(3 2)
David Gleich · Purdue
19
MLG2013
21. We view this theory as "
“intuition for the truth”
David Gleich · Purdue
21
MLG2013
22. Empirical Evaluation using
Network Community Profiles
la25 [27]) or as
304 [10]). The
the nodes.
and the edges
].
OD
ure to compute
he conductance
ost of the work
erformed when
We can express
10
0
10
1
10
2
10
3
10
4
10
5
10
−4
10
max
deg
10
0
10
1
10
2
10
3
10
4
10
5
10
−4
10
max
deg
10
0
10
1
1
10
−4
10
−3
10
0
10
1
1
10
−4
10
−3
fb-A-oneyear
10
0
10
1
10
2
10
3
10
4
10
5
10
−4
10
−3
10
−2
10
−1
10
0
max
deg
10
0
10
1
10
2
10
3
10
4
10
5
10
−4
10
−3
10
−2
10
−1
10
0
max
deg
10
0
10
1
10
−4
10
−3
10
−2
10
−1
10
0
10
0
10
1
10
−4
10
−3
10
−2
10
−1
10
0
soc-LiveJournal1 ca
10
0
10
0
10
0
10
0
Community Size
Minimum
conductance for
any community of
the given size
Approximate
canonical shape
found by
Leskovec, Lang,
Dasgupta, and
Mahoney
Holds for a variety
of approximations
to conductance.
David Gleich · Purdue
22
MLG2013
23. Empirical Evaluation using
Network Community Profiles
la25 [27]) or as
304 [10]). The
the nodes.
and the edges
].
OD
ure to compute
he conductance
ost of the work
erformed when
We can express
10
0
10
1
10
2
10
3
10
4
10
5
10
−4
10
max
deg
10
0
10
1
10
2
10
3
10
4
10
5
10
−4
10
max
deg
10
0
10
1
1
10
−4
10
−3
10
0
10
1
1
10
−4
10
−3
fb-A-oneyear
10
0
10
1
10
2
10
3
10
4
10
5
10
−4
10
−3
10
−2
10
−1
10
0
max
deg
10
0
10
1
10
2
10
3
10
4
10
5
10
−4
10
−3
10
−2
10
−1
10
0
max
deg
10
0
10
1
10
−4
10
−3
10
−2
10
−1
10
0
10
0
10
1
10
−4
10
−3
10
−2
10
−1
10
0
soc-LiveJournal1 ca
10
0
10
0
10
0
10
0
Community Size"
(Degree + 1)
Minimum
conductance for
any community
neighborhood of
the given size
“Egonet
community
profile” shows
the same
shape, 3 secs
to compute.
1.1M verts, 4M edges
The Fiedler
community
computed from
the normalized
Laplacian is a
neighborhood!
David Gleich · Purdue
Facebook data
from Wilson et
al. 2009
23
MLG2013
24. Not just one graph
10
5
10
5
10
0
10
1
10
2
10
3
10
4
10
5
10
−4
10
−3
max
deg
ver t s
2
10
0
10
1
10
2
10
3
10
4
10
5
10
−4
10
−3
max
deg
ver t s
2
arxiv
10
5
10
5
10
0
10
1
10
2
10
3
10
4
10
5
10
−4
10
−3
10
−2
10
−1
10
0
max
deg
ver t s
2
10
0
10
1
10
2
10
3
10
4
10
5
10
−4
10
−3
10
−2
10
−1
10
0
max
deg
ver t s
2
ca-AstroPh
10
0
10
0
t any procedure to compute
o compute the conductance
he graph. Most of the work
e cient is performed when
the vertex. We can express
s:
{v})/2
ighbors produces a triangle
double-counts). Note also
dges(N1(v))/2 dv. Then
ges(N1(v)). And so, given
compute the cut given the
well. This is easy to do with
tly stores the degrees.
hood communities
network community plot to
10
0
10
1
10
2
10
3
10
4
10
5
10
−4
10
−3
max
deg
10
0
10
1
10
2
10
3
10
4
10
5
10
−4
10
−3
max
deg
1
10
−4
10
−3
1
10
−4
10
−3
soc-LiveJournal1
10
0
10
1
10
2
10
3
10
4
10
5
10
−4
10
−3
10
−2
10
−1
10
0
max
deg
10
0
10
1
10
2
10
3
10
4
10
5
10
−4
10
−3
10
−2
10
−1
10
0
max
deg
1
10
−2
10
−1
10
0
1
10
−2
10
−1
10
0
Number of vertices in cluster N
Figure 2: The best neighbo
ductance at each size (black
arXiv – 86k verts, 500k edges
soc-LiveJournal – 5M verts, 42M edges
15 more graphs available
www.cs.purdue.edu/~dgleich/codes/neighborhoods
David Gleich · Purdue
24
MLG2013
25. Filling in the !
Network Community Profile
la25 [27]) or as
304 [10]). The
the nodes.
and the edges
].
OD
ure to compute
he conductance
ost of the work
erformed when
We can express
10
0
10
1
10
2
10
3
10
4
10
5
10
−4
10
max
deg
10
0
10
1
10
2
10
3
10
4
10
5
10
−4
10
max
deg
10
0
10
1
1
10
−4
10
−3
10
0
10
1
1
10
−4
10
−3
fb-A-oneyear
10
0
10
1
10
2
10
3
10
4
10
5
10
−4
10
−3
10
−2
10
−1
10
0
max
deg
10
0
10
1
10
2
10
3
10
4
10
5
10
−4
10
−3
10
−2
10
−1
10
0
max
deg
10
0
10
1
10
−4
10
−3
10
−2
10
−1
10
0
10
0
10
1
10
−4
10
−3
10
−2
10
−1
10
0
soc-LiveJournal1 ca
10
0
10
0
10
0
10
0
Minimum
conductance for
any community
neighborhood of
the given size
We are missing
a region of the
NCP when we
just look at
neighborhoods
David Gleich · Purdue
Community Size"
(Degree + 1)
25
MLG2013
Facebook Sample - 1.1M verts, 4M edges
26. 10
0
10
1
10
2
10
3
10
4
10
5
10
−4
10
−3
10
−2
10
−1
10
0
max
deg
10
0
10
1
10
2
10
3
10
4
10
5
10
−4
10
−3
10
−2
10
−1
10
0
max
deg
Filling in the !
Network Community Profile
la25 [27]) or as
304 [10]). The
the nodes.
and the edges
].
OD
ure to compute
he conductance
ost of the work
erformed when
We can express
10
0
10
1
10
2
10
3
10
4
10
5
10
−4
10
max
deg
10
0
10
1
10
2
10
3
10
4
10
5
10
−4
10
max
deg
10
0
10
1
1
10
−4
10
−3
10
0
10
1
1
10
−4
10
−3
fb-A-oneyear
10
0
10
1
10
2
10
3
10
4
10
5
10
−4
10
−3
10
−2
10
−1
10
0
max
deg
10
0
10
1
10
2
10
3
10
4
10
5
10
−4
10
−3
10
−2
10
−1
10
0
max
deg
10
0
10
1
10
−4
10
−3
10
−2
10
−1
10
0
10
0
10
1
10
−4
10
−3
10
−2
10
−1
10
0
soc-LiveJournal1 ca
10
0
10
0
10
0
10
0
Minimum
conductance for
any community of
the given size
7807 seconds
This region
fills when
using the
PPR method
(like now!)
David Gleich · Purdue
Community Size"
26
MLG2013
Facebook Sample - 1.1M verts, 4M edges
27. Am I a good seed?!
Locally Minimal Communities
“My conductance is the best locally.”
(N(v)) (N(w))
for all w adjacent to v
In Zachary’s Karate Club
network, there are four locally
minimal communities, the two
leaders and two peripheral
nodes.
David Gleich · Purdue
27
MLG2013
28. 10
0
10
1
10
2
10
3
10
4
10
5
10
−4
10
−3
10
−2
10
−1
10
0
max
deg
10
0
10
1
10
2
10
3
10
4
10
5
10
−4
10
−3
10
−2
10
−1
10
0
max
deg
Locally minimal communities
capture extremal neighborhoods
la25 [27]) or as
304 [10]). The
the nodes.
and the edges
].
OD
ure to compute
he conductance
ost of the work
erformed when
We can express
10
0
10
1
10
2
10
3
10
4
10
5
10
−4
10
max
deg
10
0
10
1
10
2
10
3
10
4
10
5
10
−4
10
max
deg
10
0
10
1
1
10
−4
10
−3
10
0
10
1
1
10
−4
10
−3
fb-A-oneyear
10
0
10
1
10
2
10
3
10
4
10
5
10
−4
10
−3
10
−2
10
−1
10
0
max
deg
10
0
10
1
10
2
10
3
10
4
10
5
10
−4
10
−3
10
−2
10
−1
10
0
max
deg
10
0
10
1
10
−4
10
−3
10
−2
10
−1
10
0
10
0
10
1
10
−4
10
−3
10
−2
10
−1
10
0
soc-LiveJournal1 ca
10
0
10
0
10
0
10
0
Red dots are
conductance "
and size of a "
locally minimal
community
Usually about 1%
of # of vertices.
The red
circles – the
best local
mins – find
the extremes
in the egonet
profile.
David Gleich · Purdue
Community Size"
28
MLG2013
Facebook Sample - 1.1M verts, 4M edges
29. 10
0
10
1
10
2
10
3
10
4
10
5
10
−4
10
−3
10
−2
10
−1
10
0
max
deg
10
0
10
1
10
2
10
3
10
4
10
5
10
−4
10
−3
10
−2
10
−1
10
0
max
deg
Filling in the NCP!
Growing locally minimal comm.
la25 [27]) or as
304 [10]). The
the nodes.
and the edges
].
OD
ure to compute
he conductance
ost of the work
erformed when
We can express
10
0
10
1
10
2
10
3
10
4
10
5
10
−4
10
max
deg
10
0
10
1
10
2
10
3
10
4
10
5
10
−4
10
max
deg
10
0
10
1
1
10
−4
10
−3
10
0
10
1
1
10
−4
10
−3
fb-A-oneyear
10
0
10
1
10
2
10
3
10
4
10
5
10
−4
10
−3
10
−2
10
−1
10
0
max
deg
10
0
10
1
10
2
10
3
10
4
10
5
10
−4
10
−3
10
−2
10
−1
10
0
max
deg
10
0
10
1
10
−4
10
−3
10
−2
10
−1
10
0
10
0
10
1
10
−4
10
−3
10
−2
10
−1
10
0
soc-LiveJournal1 ca
10
0
10
0
10
0
10
0
PPR growing
only locally min
communities,
seeded from
entire egonet
3 seconds
283 seconds
7807 seconds
Full NCP
Locally min
NCP
Original
Egonet
David Gleich · Purdue
Community Size"
29
MLG2013
30. But there’s a small problem.
Most people want to cover a
network with communities!
We just looked at the best.
David Gleich · Purdue
30
MLG2013
31. The coverage of egonet-grown
communities is really bad.
David Gleich · Purdue
10
−3
10
−2
10
−1
10
0
10
−2
10
−1
10
0
0.7%
=0.10
2.2%
=0.25
29.8%
=0.88
10
−3
10
−2
10
−1
10
0
10
−2
10
−1
10
0
0.7%
=0.10
2.2%
=0.25
29.8%
=0.88
Coverage
MaxConductance
Facebook network
With a conductance
of 0.1 (not so good)
we only cover 1% of
the vertices in the
network.
Log-Log scale!
31
MLG2013
32. Whang-Gleich-Dhillon,
CIKM2013 [upcoming…]
1. Extract part of the graph that might have
overlapping communities.
2. Compute a partitioning of the network into
many pieces (think sqrt(n)) using Graclus.
3. Find the center of these partitions.
4. Use PPR to grow egonets of these centers.
David Gleich · Purdue
32
MLG2013
33. Student Version of MATLAB
(a) AstroPh
0 10 20 30 40 50 60 70 80 90 100
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Coverage (percentage)
MaximumConductance
egonet
graclus centers
spread hubs
random
bigclam
(d) Flickr
Flickr social
network
2M vertices"
22M edges
We can cover
95% of network
with communities
of cond. ~0.15.
David Gleich · Purdue
A good partitioning helps!
33
MLG2013
flickr sample - 2M verts, 22M edges
34. F1 F2
0.1
0.12
0.14
0.16
0.18
0.2
0.22
0.24
DBLP
demon
bigclam
graclus centers
spread hubs
random
egonet
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
Figure 3: F1 and F2 measures comparing our algorithmic co
indicates better communities.
Run time Our seed
Using datasets from "
Yang and Leskovec
(WDSM 2013) with
known overlapping
community structure
Our method outperform
current state of the art
overlapping community
detection methods. "
Even randomly seeded!
David Gleich · Purdue
And helps to find real-world
overlapping communities too.
34
MLG2013
35. Conclusion & Discussion &
PPR community detection is fast "
[Andersen et al. FOCS06]
PPR communities look real "
[Abrahao et al. KDD2012; Zhu et al. ICML2013]
Egonet analysis reveals basis of NCP
Partitioning for seeding yields "
high coverage & real communities.
“Caveman” communities?!
!
!
!
MLG2013
David Gleich · Purdue
35
Gleich & Seshadhri
KDD2012
Whang, Gleich & Dhillon
CIKM2013
PPR Sample !
bit.ly/18khzO5!
!
Egonet seeding
bit.ly/dgleich-code!
References
Best conductance cut
at intersection of
communities?
36. Proof Sketch
1) Large clustering coefficient "
⇒ many wedges are closed
2) Heavy tailed degree dist "
⇒ a few vertices have a very large degree
3) Large degree ⇒ O(d 2) wedges ⇒ “most” of wedges
Thus, there must exist a vertex with a high edge density ⇒
“good” conductance
Use the probabilistic method to formalize
10
0
10
1
10
2
10
3
10
4
0
0.2
0.4
0.6
0.8
1
CDFofNumberofWedges
Degree
David Gleich · Purdue
36
MLG2013