2. Abstract
Numerical Linear Algebra for Data and Link Analysis
Modern information retrieval and data mining systems must operate on extremely large datasets and require efficient, robust and
scalable algorithms. Numerical linear algebra provides a solid foundation for the development of such algorithms and analysis
of their behavior.
In this talk I will discuss several linear algebra based methods and their practical applications:
i) Spectral graph partitioning. I will describe a recursive spectral algorithm for bi-partite graph partitioning and its application
to simultaneous clustering of bidded terms and advertisers in pay-for-performance market data. I will also present a new local
refinement strategy that allows us to improve cluster quality.
ii) Web graph link analysis. I will discuss a linear system formulation of the PageRank algorithm and the use of Krylov
subspace methods for an efficient solution. I will also describe our scalable parallel implementation and present results of
numerical experiments for the convergence of iterative methods on multiple graphs with various parameter settings.
In conclusion I will outline some difficulties encountered while developing these applications and address possible solutions
and future research directions.
3. Outline
• Introduction
– Computational science and information retrieval
• Spectral clustering and graph partitioning
– Spectral clustering
– Flow refinement
– Bi-partite spectral and advertiser-term clustering
• Web graph link analysis
– PageRank as linear system
– Krylov subspace methods
– Numerical experiments
• Parallel implementation
– Distributed matrices
– MPI, PETSc, etc
• Conclusion and future work
4. 1. Introduction
1.1. Computational science for information retrieval
• Multiple applications of numerical methods, no specilized algorithms
• Large scale problems
• Practical applications
Scientific Computing Information Retrieval
Problem in continuum, governed by PDE Discrete data is given
discretization for numerical solution no control over problem size
control over resolution
2D or 3D geometry High dimensional spaces
Uniform distribution of node degrees Power-low degree distribution
5. 1.2. Scientific Computing vs Information Retrieval graphs
FEM mesh for CFD simulations Artist-Artist similarity graph
7. 2.1. Graph partitioning
• Bisecting the graph, edge separator
Good and balanced cut
• Balanced partition
• “Natural” boundaries partition = clustering
8. 2.2. Metrics - good cut
• Partitioning:
cut(V1, V2) = eij ; assoc(V1, V ) = d(vi)
i∈V1 ,j∈V2 i∈V1
• Objective functions:
– Minimal cut:
M Cut(V1, V2) = cut(V1, V2);
– Normalized cut:
cut(V1, V2) cut(V1, V2)
N Cut(V1, V2) = +
assoc(V1, V ) assoc(V2, V )
– Quotient Cut:
cut(V1, V2)
QCut(V1, V2) =
min(assoc(V1, V ), assoc(V2, V ))
9. 2.3. Graph cuts
• Let G = (V, E) - graph, A(G) - adjacency matrix
• Let V = V + ∪ V − be partitioning of the nodes
• Let v = {+1, −1, +1, ... − 1, +1}T - indicator vector
-1 -1 +1 +1 +1
x x x x x
• v(i) = +1, if v(i) ∈ V +; v(i) = −1, if v(i) ∈ V −
• Compute the number of edges, connecting V + and V −
1 1
cut(V +, V −) = (v(i) − v(j))2 = vT Lv
4 4
e(i,j)
• L=D−A
• Minimal cut partitioning - smallest number of edges to remove
• Exact solution is NP-hard!
10. 2.4. Spectral method - motivation (from Physics)
• Linear graph - 5 nodes:
1 2 3 4 5
x x x x x
• Energy of the system:
1 1
E= m x(i)2 + k
˙ (x(i) − x(j))2
2 i
2 i,j
• Equations of motion:
d2x
M 2 = −kLx
dt
• Laplacian matrix 5x5:
1 −1
−1 2 −1
L= −1 2 −1
−1 2 −1
−1 1
11. 2.5. Spectral method - motivation (from Physics)
• Eigenproblem:
Lx = λx
2
• Second lowest λ2 = ω2 mode bisecting the string into two equal sized
components
12. 2.6. Spectral method - relaxation
• Discrete problem → continuous problem
• Discrete problem:
find
1
min( vT Lv)
4
constraints v(i) = ±1, i v(i) = 0;
• Relaxation - continuous problem:
find
1
min( xT Lx)
4
constraints: x(i)2 = N , i x(i) = 0
• Exact constraint satisfies relaxed equation, but not other way around!
• Given x(i), round them up by v(i) = sign(x(i))
13. 2.7. Spectral method - computations
• Constraint optimization problem:
1
Q(x) = xT Lx − λ(xT x − N )
4
• Additional constraint: x e = {1, 1, 1, .., 1}
• Minimization
1 xT Lx
min( )
x⊥x1 4 xT x
• Courant Fischer Minimax Theorem
Lx = λx
Looking for λ2 (second smallest) eigenvalue and x2
14. 2.8. Family of spectral methods
• Ratio cut:
cut(V1, V2) cut(V1, V2)
RCut(V1, V2) = +
|V1| |V2|
(D − A)x = λx
• Normalized cut:
cut(V1, V2) cut(V1, V2)
N Cut(V1, V2) = +
assoc(V1, V ) assoc(V2, V )
assoc(V1, V1) assoc(V2, V2)
N Cut(V1, V2) = 2 − ( + )
assoc(V1, V ) assoc(V2, V )
(D − A)x = λDx
15. 2.9. Spectral partitioning algorithm
Algorithm 1
Compute the eigenvector v2 corresponding to λ2 of L(G)
for all node n in G do
if v2 (n) < 0 then
put node n in partition V-
else
put node n in partition V+
end if
end for
16. 2.10. Spectral ordering algorithm
Algorithm 2
Compute the eigenvector v2 corresponding to λ2 of L(G)
for all node n in G do
sort n according to v2 (n)
end for
• Permute columns and rows of A according to “new” ordering
• Since − v(j))2 is minimized ⇒
e(i,j) (v(i)
there are few edges connecting distant v(i) and v(j)
20. 2.14. Flow refinment
Set up and solve minimum S-T cut problem
• Divide node in 3 sets according to embedding ordering
• set up s-t max flow problem with one set of nodes pinned to the source and
another to the sink with inf capacity links
• solve to obtain S-T min cut ( min-cut max-flow theorem, find saturated fron-
tier),
• move the partition
21. 2.15. Flow refinment
cut(A,B)=171 cut(A,B)=70
QCut=0.0108 QCut=0.0053
NCut=0.0206 NCut=0.0088
part size=1433 part size=1195
22. 2.16. Flow refinment
cut(A,B)=11605 cut(A,B)=36688
QCut=0.242 QCut=0.160
NCut=0.267 NCut=0.296
part size=266 part size=1103
23. 2.17. Recursive spectral
• tree → flat clusters
25. 2.19. Data: Advertiser - bidded term data
aj Terms
aj Terms
ti
ti
A= A=
Advertisers
Advertisers
• Simultaneous clustering of advertisers and bidded terms (co-clustering)
• Bi-partite graph partitioning problem
26. 2.20. Bi-partite graph case
• Adjacency matrix for the bipartite graph
ˆ 0 A
A=
AT 0
• Eigensystem:
D1 −A x D1 0 x
=λ
−AT D2 y 0 D2 y
• Normalization:
−1/2 −1/2
An = D1 AD2
Anv = (1 − λ)u
AT u = (1 − λ)v
n
• SVD decomposition:
An = uσvT , σ = 1 − λ
29. 2.23. Computational consideration
• Large and very sparse matrices
• Only top few eigenvectors needed
• Precision requirements low
• Iterative Krylov subspace methods, Lanczos and Arnoldi algorithms
• Only matrix-vector multiply
31. 3.1. PageRank model
• Random walk on the graph
• Markov process: memoryless, homogeneous,
• Stationary distribution: existence, uniqueness, convergence.
• Perron-Frobenius theorem; irreducible, every state is reachable from every
other, and aperiodic - no cycles
32. 3.2. PageRank model
• Construct probability matrix
P = D−1A, D = diag(A)
• Construct transition matrix for Markov process (row-stochastic)
P = P + (dvT )
• Correct reducibility (irreducible)
P = cP + (1 − c)(evT )
• Markov chain stationary distribution exist and unique (Perron-Frobenius)
P T p = λp
33. 3.3. Linear system formulation
• PageRank equation
(cP + c(dvT ) + (1 − c)(evT ))x = λx
• Normalization
(eT x) = (xT e) = ||x||1, λ1 = 1
• Identity
(dT x) = ||x|| − ||PT x||.
• Linear system
(I − cPT )x = v(||x||1 − c||PT x||1)
34. 3.4. Linear System vs Eigensystem
Eigensystem Linear system
P T p = λp (I − cPT )x = k(x) v
P = cP + c(dvT ) + (1 − c)(evT ) k(x) = ||x||1 − c||PT x||1
x
λ=1 p= ||x||1
• Iteration matrices: P , I − cPT - different rate of convergence
• Vector v - rhs or in the matrix
• More methods available for linear system
• Solution is linear with respect to v
38. 3.8. Krylov subspace methods
• Linear system
Ax = b, A = I − cPT , b = kv
• Residual
r = b − Ax
• Krylov subspace
Km = span{r, Ar, A2r, A3r...Amr}
• xm is build from x0 + Km, xm = x0 + qm−1(A)r0
• Only matrix-vector products
• Explicit minimization in subspace, extra information for next step
39. 3.9. Krylov subspace methods
• Generalize Minimum Residual (GMRES)
pick xn ∈ Kn , such that min ||b − Axn ||, rn ⊥ AKn
• Biconjugate Gradient (BiCG)
n−1
pick xn ∈ Kn , such that rn ⊥ span{w, AT w, ...AT w)
• Biconjugate Gradient Stabilized (BiCGSTAB)
• Quasi-Minimal Residual (QMR)
• Conjugate Gradient Squared (CGS)
• Chebyshev Iterations.
Preconditioners
• Convergence depends on cond(A) = λmax/λmin
• Preconditioner M, M−1A x = M−1b
• Iterate M−1A - better condition number
• Diagonal preconditioner M = D
50. 4.1. Matrix-Vector multiply
• Iterative process
Ax→x
• Every process “owns” several rows of the matrix
• Every process “owns” corresponding part of the vector
• Communications required for multiplication
51. 4.2. Distributed matrices
• Computing:
– Load balancing: equal number of non-zeros per processor
– Minimize communications: smallest number “of the processor” ele-
ments
• Storage:
– Number of non-zeros per processor
– Number of rows per processor
52. 4.3. Practical data distribution
• Balanced graph partitioning
– Exact - NP hard
– Approximate - multi-resolution, spectral, geometric,
• Practical solution
– Sort graph in lexigraphic order
– Fill processors consecutively by row, adding rows until
wrowsnp + wnnz nnzp > (wrowsn + wnnz nnz)/p
with wrows : wnnz = 1/1, 2/1, 4/1
53. 4.4. Data distribution schemes
y2 std parellelization and distribution
400
smart
nrows
350
300
250
time, s
200
150
100
50
5 10 15 20 25 30
# of processors
60. 5. Conclusions
• Eigenvalues everywhere! Linear algebra methods provide provably good
solutions to many problems. Methods are very general.
• Power-law graphs with high variance in node degrees present challenges to
high performance parallel computing
• Skewed distribution, chains, central core, singletons makes clustering of
power-law data a difficult problem
• Embedding in 1D is probably not sufficient for this type of data, higher
dimensions needed
61. 5.1. References
• Collaborators:
– Kevin Lang, Pavel Berkhin
– David Gleich and Matt Rasmussen
• Publications:
– “Fast Parallel PageRank: A Linear System Approach”, 2004
– “Spectral Clustering of Large Advertiser Datasets”, 2003
– “Clustering of bipartite advertiser-keyword graph”, 2002
• References:
– Spectral graph partitioning:
M. Fiedler (1973), A. Pothen (1990), H. Simon (1991), B. Mohar (1992), B. Hendrickson
(1995), D. Spielman (1996), F. Chang (1996), S. Guattery (1998), R. Kannan (1999), J.
Shi (2000), I. Dhillon ( 2001), A. Ng (2001), H. Zha (2001), C. Ding (2001)
– PageRank computing:
S.Brin (1998), L. Page (1998), J. Kleinberg (1999), A. Arasu (2002), T. Haveliwala
(2002-03), A. Langville (2002), G. Jeh (2003), S. Kamvar (2003), A. Broder (2004)