2. 2
The Web as a Graph
Pages as graph nodes, hyperlinks as edges.
– Sometimes sites are taken as the nodes
Some natural questions:
1. Distribution of the number of in-links to a page.
2. Distribution of the number of out-links from a page.
3. Distribution of the number of pages in a site.
4. Connectivity: is it possible to reach most pages from most
pages?
5. Is there a theoretical model that fits the graph?
3. 3
Mathematical Background:
Power-Law Distributions
• A non-negative random variable X is said to have a Power-Law
distribution if, for some constants c>0 and α>0:
Prob[X>x] ~ x-α, or equivalently f(x) ~ x-(α+1)
• Taking logs from both sides, we have:
log Prob[X>x] = -αlog(x) + c
• Power Law distributions have “heavy/long tails”, i.e. the
probability mass of events whose value is far from the
expectancy or median of the distribution is significant
– Unlike Normal or Geometric/Exponential distributions, where the probability
mass of the tail decreases exponentially, in Power Law distributions the mass
of the tail decreases by the constant power of α
– Another point of view: in an Exponential distribution, f(x)/p(x+k) is constant,
whereas in a Power-Law distribution, f(x)/f(kx) is constant.
– The “average” quantity in a Power-Law distribution is not “typical”
• Examples of Power-Law distributions are Pareto and Zipf
distributions (see next slides)
4. 4
Mathematical Background:
The Pareto Distribution
• A continuous, positive random variable X in the range
[L,] is said to be distributed Pareto(L,k) if its probability
density function is:
f(X=x;k;L) = k Lk / xk+1
• This implies that Prob(X>x) = (L/x)k
– Has finite expectancy of Lk/(k-1) only for k>1
– Has finite variance only for k>2
• Named after the Italian economist Vilfredo Pareto (1848-
1923), who modeled with it the distribution of wealth in
society
– Most people have little income; 20% of society holds 80% of the
wealth
5. 5
Mathematical Background:
Zipf’s Law
• A random variable X follows Zipf’s Law (is “Zipfian”) with
parameter α when the j’th most popular value of X occurs
with probability that is proportional to j-α
– Essentially the distribution is over the discrete ranks
• Whenever α>1, X may take an infinite number of values (i.e.
have infinitely many different value popularities)
• Named after the American Linguist George Kingsley Zipf
(1902-1950), who observed it on the frequencies of words
in the English language
– On a large corpus of English text, the 135 most frequently occurring
words accounted for half of the text
6. 6
Mathematical Background:An Observed
Zipfian Sample Implies a Power-Law
The following analysis is due to Lada Adamic:
• Assume that N units of wealth (coins) are distributed to M
individuals
– There are N observations of a random variable Y that can take on
the discrete values 1,2,…,M
• Yk=j (k=1,…N, j=1..M) means that person j got coin k
– Denote by X1[Xm] the number of coins of the richest[poorest]
individual at the end of the process
• For simplicity, assume that N>>M and the Xj’s are all distinct
• Assume that a perfect Zipfian behavior is observed, i.e. Xr/N
~ r-b for all r=1,…M
– This trivially implies Xr ~ r-b
7. 7
Mathematical Background:An Observed
Zipfian Sample Implies a Power-Law (cont.)
• Recap: we distributed N coins to M individuals, and
denoted by X1[Xm] the number of coins of the
richest[poorest] individual at the end of the process
• By assuming Zipfian wealth: Xr ~ r-b, or Xr=cr-b
• Let Z be the random variable of a person’s wealth, i.e. the
number of coins a person gets by this process
• Observation: if the r’th richest person got Xr coins, then
exactly r people out of M got Xr coins or more
• Pr[Z Xr]=Pr[Z cr-b]=r/M
• Define y= cr-b, and so r=(y/c)-(1/b), and so
Pr[Z y]= y-(1/b) c(1/b)/M
• Hence Pr[Z y] ~ y-(1/b), and Z obeys a Power-Law
8. 8
Distribution of Inlinks
* Image taken from “Graph Structure in the Web”, Broder et al., WWW’2000.
A plot of the number of nodes having
each value of in-degree
Both axes are in log-scale
Denoting the size of the sample crawl
by N (over 200M here), we have:
Log (N*Prob[node has in-degree x]) -a*log(x)+c
Log (Prob[node has in-degree x]) -a*log(x)+c’
Which indicates the Power-Law
Prob[node has in-degree x] ~ x-a
Note that the number of nodes with small in-degree is over-estimated while the
number of nodes with very high in-degree is under-estimated
9. 9
More Power-Laws on the Web
We’ve seen that the in-degree of pages exhibits a Power-Law.
Furthermore:
• Out-degree (somewhat surprising)
• Degrees of the inter-host graph
• Number of pages in Web sites
• Number of visits to Web sites/pages
• PageRank scores
– With an exponent very close to that of the in-degree distribution
– Curiously, degrees in the telephone call graph have the same 2.1
exponent
• Frequencies of words (as observed by Zipf)
• Popularities of queries submitted to search engines (will be discussed
later in the course)
10. 10
The Web as a Graph
Connectivity: is it possible to reach most pages from
most pages?
The Web is a bow-tie!
The Web graph is also
scale-free, fractal:
many slices and
subgraphs exhibit
similar properties.
Image taken from “Graph Structure in the Web”,
Broder et al., WWW’2000.
11. 11
Self-Similarity on the Web
Dill et al., ACM TOIT 2002
• Created large Thematically Unified Clusters (TUCs)
• Pages containing a certain keyword
• Pages of large Web sites/Intranets
• Pages containing a geographical reference in the Western US
• The host graph
• In general, the TUCs display very similar graph properties,
e.g.
• In/out degree distributions
• Bow-tie structure (relative sizes of the components)
• Also discovered that the SCC of the different TUCs are
strongly connected, i.e. it is possible to browse between
the TUCs