SlideShare una empresa de Scribd logo
1 de 11
Descargar para leer sin conexión
Web Graph Characteristics
Kira Radinsky
2
The Web as a Graph
Pages as graph nodes, hyperlinks as edges.
– Sometimes sites are taken as the nodes
Some natural questions:
1. Distribution of the number of in-links to a page.
2. Distribution of the number of out-links from a page.
3. Distribution of the number of pages in a site.
4. Connectivity: is it possible to reach most pages from most
pages?
5. Is there a theoretical model that fits the graph?
3
Mathematical Background:
Power-Law Distributions
• A non-negative random variable X is said to have a Power-Law
distribution if, for some constants c>0 and α>0:
Prob[X>x] ~ x-α, or equivalently f(x) ~ x-(α+1)
• Taking logs from both sides, we have:
log Prob[X>x] = -αlog(x) + c
• Power Law distributions have “heavy/long tails”, i.e. the
probability mass of events whose value is far from the
expectancy or median of the distribution is significant
– Unlike Normal or Geometric/Exponential distributions, where the probability
mass of the tail decreases exponentially, in Power Law distributions the mass
of the tail decreases by the constant power of α
– Another point of view: in an Exponential distribution, f(x)/p(x+k) is constant,
whereas in a Power-Law distribution, f(x)/f(kx) is constant.
– The “average” quantity in a Power-Law distribution is not “typical”
• Examples of Power-Law distributions are Pareto and Zipf
distributions (see next slides)
4
Mathematical Background:
The Pareto Distribution
• A continuous, positive random variable X in the range
[L,] is said to be distributed Pareto(L,k) if its probability
density function is:
f(X=x;k;L) = k Lk / xk+1
• This implies that Prob(X>x) = (L/x)k
– Has finite expectancy of Lk/(k-1) only for k>1
– Has finite variance only for k>2
• Named after the Italian economist Vilfredo Pareto (1848-
1923), who modeled with it the distribution of wealth in
society
– Most people have little income; 20% of society holds 80% of the
wealth
5
Mathematical Background:
Zipf’s Law
• A random variable X follows Zipf’s Law (is “Zipfian”) with
parameter α when the j’th most popular value of X occurs
with probability that is proportional to j-α
– Essentially the distribution is over the discrete ranks
• Whenever α>1, X may take an infinite number of values (i.e.
have infinitely many different value popularities)
• Named after the American Linguist George Kingsley Zipf
(1902-1950), who observed it on the frequencies of words
in the English language
– On a large corpus of English text, the 135 most frequently occurring
words accounted for half of the text
6
Mathematical Background:An Observed
Zipfian Sample Implies a Power-Law
The following analysis is due to Lada Adamic:
• Assume that N units of wealth (coins) are distributed to M
individuals
– There are N observations of a random variable Y that can take on
the discrete values 1,2,…,M
• Yk=j (k=1,…N, j=1..M) means that person j got coin k
– Denote by X1[Xm] the number of coins of the richest[poorest]
individual at the end of the process
• For simplicity, assume that N>>M and the Xj’s are all distinct
• Assume that a perfect Zipfian behavior is observed, i.e. Xr/N
~ r-b for all r=1,…M
– This trivially implies Xr ~ r-b
7
Mathematical Background:An Observed
Zipfian Sample Implies a Power-Law (cont.)
• Recap: we distributed N coins to M individuals, and
denoted by X1[Xm] the number of coins of the
richest[poorest] individual at the end of the process
• By assuming Zipfian wealth: Xr ~ r-b, or Xr=cr-b
• Let Z be the random variable of a person’s wealth, i.e. the
number of coins a person gets by this process
• Observation: if the r’th richest person got Xr coins, then
exactly r people out of M got Xr coins or more
• Pr[Z  Xr]=Pr[Z  cr-b]=r/M
• Define y= cr-b, and so r=(y/c)-(1/b), and so
Pr[Z  y]= y-(1/b) c(1/b)/M
• Hence Pr[Z  y] ~ y-(1/b), and Z obeys a Power-Law
8
Distribution of Inlinks
* Image taken from “Graph Structure in the Web”, Broder et al., WWW’2000.
A plot of the number of nodes having
each value of in-degree
Both axes are in log-scale
Denoting the size of the sample crawl
by N (over 200M here), we have:
Log (N*Prob[node has in-degree x])  -a*log(x)+c
Log (Prob[node has in-degree x])  -a*log(x)+c’
Which indicates the Power-Law
Prob[node has in-degree x] ~ x-a
Note that the number of nodes with small in-degree is over-estimated while the
number of nodes with very high in-degree is under-estimated
9
More Power-Laws on the Web
We’ve seen that the in-degree of pages exhibits a Power-Law.
Furthermore:
• Out-degree (somewhat surprising)
• Degrees of the inter-host graph
• Number of pages in Web sites
• Number of visits to Web sites/pages
• PageRank scores
– With an exponent very close to that of the in-degree distribution
– Curiously, degrees in the telephone call graph have the same 2.1
exponent
• Frequencies of words (as observed by Zipf)
• Popularities of queries submitted to search engines (will be discussed
later in the course)
10
The Web as a Graph
Connectivity: is it possible to reach most pages from
most pages?
The Web is a bow-tie!
The Web graph is also
scale-free, fractal:
many slices and
subgraphs exhibit
similar properties.
Image taken from “Graph Structure in the Web”,
Broder et al., WWW’2000.
11
Self-Similarity on the Web
Dill et al., ACM TOIT 2002
• Created large Thematically Unified Clusters (TUCs)
• Pages containing a certain keyword
• Pages of large Web sites/Intranets
• Pages containing a geographical reference in the Western US
• The host graph
• In general, the TUCs display very similar graph properties,
e.g.
• In/out degree distributions
• Bow-tie structure (relative sizes of the components)
• Also discovered that the SCC of the different TUCs are
strongly connected, i.e. it is possible to browse between
the TUCs

Más contenido relacionado

Similar a Tutorial 6 (web graph attributes)

Exploratory social network analysis with pajek
Exploratory social network analysis with pajekExploratory social network analysis with pajek
Exploratory social network analysis with pajekTHomas Plotkowiak
 
Topology ppt
Topology pptTopology ppt
Topology pptboocse11
 
(121013) #fitalk locating the source of diffusion in large-scale network
(121013) #fitalk   locating the source of diffusion in large-scale network(121013) #fitalk   locating the source of diffusion in large-scale network
(121013) #fitalk locating the source of diffusion in large-scale networkINSIGHT FORENSIC
 
(121013) #fitalk locating the source of diffusion in large-scale network
(121013) #fitalk   locating the source of diffusion in large-scale network(121013) #fitalk   locating the source of diffusion in large-scale network
(121013) #fitalk locating the source of diffusion in large-scale networkINSIGHT FORENSIC
 
The Web of Data: do we actually understand what we built?
The Web of Data: do we actually understand what we built?The Web of Data: do we actually understand what we built?
The Web of Data: do we actually understand what we built?Frank van Harmelen
 
4cliquesclusters-1235090001265558-2.pdf
4cliquesclusters-1235090001265558-2.pdf4cliquesclusters-1235090001265558-2.pdf
4cliquesclusters-1235090001265558-2.pdf9260SahilPatil
 
A small debate of power of randomness
A small debate of power of randomnessA small debate of power of randomness
A small debate of power of randomnessAbner Chih Yi Huang
 
UnSupervised Machincs4811-ch23a-clustering.ppt
UnSupervised Machincs4811-ch23a-clustering.pptUnSupervised Machincs4811-ch23a-clustering.ppt
UnSupervised Machincs4811-ch23a-clustering.pptRamanamurthy Banda
 
Scalable membership management
Scalable membership management Scalable membership management
Scalable membership management Vinay Setty
 
From complex Systems to Networks: Discovering and Modeling the Correct Network"
From complex Systems to Networks: Discovering and Modeling the Correct Network"From complex Systems to Networks: Discovering and Modeling the Correct Network"
From complex Systems to Networks: Discovering and Modeling the Correct Network"diannepatricia
 
Graph Sample and Hold: A Framework for Big Graph Analytics
Graph Sample and Hold: A Framework for Big Graph AnalyticsGraph Sample and Hold: A Framework for Big Graph Analytics
Graph Sample and Hold: A Framework for Big Graph AnalyticsNesreen K. Ahmed
 
Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford MapR Technologies
 

Similar a Tutorial 6 (web graph attributes) (20)

Radcliffe
RadcliffeRadcliffe
Radcliffe
 
TopologyPPT.ppt
TopologyPPT.pptTopologyPPT.ppt
TopologyPPT.ppt
 
Exploratory social network analysis with pajek
Exploratory social network analysis with pajekExploratory social network analysis with pajek
Exploratory social network analysis with pajek
 
Topology ppt
Topology pptTopology ppt
Topology ppt
 
Topology ppt
Topology pptTopology ppt
Topology ppt
 
Topology ppt
Topology pptTopology ppt
Topology ppt
 
(121013) #fitalk locating the source of diffusion in large-scale network
(121013) #fitalk   locating the source of diffusion in large-scale network(121013) #fitalk   locating the source of diffusion in large-scale network
(121013) #fitalk locating the source of diffusion in large-scale network
 
(121013) #fitalk locating the source of diffusion in large-scale network
(121013) #fitalk   locating the source of diffusion in large-scale network(121013) #fitalk   locating the source of diffusion in large-scale network
(121013) #fitalk locating the source of diffusion in large-scale network
 
The Web of Data: do we actually understand what we built?
The Web of Data: do we actually understand what we built?The Web of Data: do we actually understand what we built?
The Web of Data: do we actually understand what we built?
 
4cliquesclusters-1235090001265558-2.pdf
4cliquesclusters-1235090001265558-2.pdf4cliquesclusters-1235090001265558-2.pdf
4cliquesclusters-1235090001265558-2.pdf
 
A small debate of power of randomness
A small debate of power of randomnessA small debate of power of randomness
A small debate of power of randomness
 
CNN for modeling sentence
CNN for modeling sentenceCNN for modeling sentence
CNN for modeling sentence
 
UnSupervised Machincs4811-ch23a-clustering.ppt
UnSupervised Machincs4811-ch23a-clustering.pptUnSupervised Machincs4811-ch23a-clustering.ppt
UnSupervised Machincs4811-ch23a-clustering.ppt
 
Scalable membership management
Scalable membership management Scalable membership management
Scalable membership management
 
From complex Systems to Networks: Discovering and Modeling the Correct Network"
From complex Systems to Networks: Discovering and Modeling the Correct Network"From complex Systems to Networks: Discovering and Modeling the Correct Network"
From complex Systems to Networks: Discovering and Modeling the Correct Network"
 
powerlaws.pptx
powerlaws.pptxpowerlaws.pptx
powerlaws.pptx
 
Learn about Your Location (Using ALL Your Data)
Learn about Your Location (Using ALL Your Data)Learn about Your Location (Using ALL Your Data)
Learn about Your Location (Using ALL Your Data)
 
Graph Sample and Hold: A Framework for Big Graph Analytics
Graph Sample and Hold: A Framework for Big Graph AnalyticsGraph Sample and Hold: A Framework for Big Graph Analytics
Graph Sample and Hold: A Framework for Big Graph Analytics
 
Clustering - ACM 2013 02-25
Clustering - ACM 2013 02-25Clustering - ACM 2013 02-25
Clustering - ACM 2013 02-25
 
Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford
 

Más de Kira

Tutorial 14 (collaborative filtering)
Tutorial 14 (collaborative filtering)Tutorial 14 (collaborative filtering)
Tutorial 14 (collaborative filtering)Kira
 
Tutorial 12 (click models)
Tutorial 12 (click models)Tutorial 12 (click models)
Tutorial 12 (click models)Kira
 
Tutorial 11 (computational advertising)
Tutorial 11 (computational advertising)Tutorial 11 (computational advertising)
Tutorial 11 (computational advertising)Kira
 
Tutorial 10 (computational advertising)
Tutorial 10 (computational advertising)Tutorial 10 (computational advertising)
Tutorial 10 (computational advertising)Kira
 
Tutorial 9 (bloom filters)
Tutorial 9 (bloom filters)Tutorial 9 (bloom filters)
Tutorial 9 (bloom filters)Kira
 
Tutorial 8 (web graph models)
Tutorial 8 (web graph models)Tutorial 8 (web graph models)
Tutorial 8 (web graph models)Kira
 
Tutorial 7 (link analysis)
Tutorial 7 (link analysis)Tutorial 7 (link analysis)
Tutorial 7 (link analysis)Kira
 
Tutorial 5 (lucene)
Tutorial 5 (lucene)Tutorial 5 (lucene)
Tutorial 5 (lucene)Kira
 
Tutorial 4 (duplicate detection)
Tutorial 4 (duplicate detection)Tutorial 4 (duplicate detection)
Tutorial 4 (duplicate detection)Kira
 
Tutorial 3 (b tree min heap)
Tutorial 3 (b tree min heap)Tutorial 3 (b tree min heap)
Tutorial 3 (b tree min heap)Kira
 
Tutorial 2 (mle + language models)
Tutorial 2 (mle + language models)Tutorial 2 (mle + language models)
Tutorial 2 (mle + language models)Kira
 
Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)Kira
 
Tutorial 13 (explicit ugc + sentiment analysis)
Tutorial 13 (explicit ugc + sentiment analysis)Tutorial 13 (explicit ugc + sentiment analysis)
Tutorial 13 (explicit ugc + sentiment analysis)Kira
 

Más de Kira (13)

Tutorial 14 (collaborative filtering)
Tutorial 14 (collaborative filtering)Tutorial 14 (collaborative filtering)
Tutorial 14 (collaborative filtering)
 
Tutorial 12 (click models)
Tutorial 12 (click models)Tutorial 12 (click models)
Tutorial 12 (click models)
 
Tutorial 11 (computational advertising)
Tutorial 11 (computational advertising)Tutorial 11 (computational advertising)
Tutorial 11 (computational advertising)
 
Tutorial 10 (computational advertising)
Tutorial 10 (computational advertising)Tutorial 10 (computational advertising)
Tutorial 10 (computational advertising)
 
Tutorial 9 (bloom filters)
Tutorial 9 (bloom filters)Tutorial 9 (bloom filters)
Tutorial 9 (bloom filters)
 
Tutorial 8 (web graph models)
Tutorial 8 (web graph models)Tutorial 8 (web graph models)
Tutorial 8 (web graph models)
 
Tutorial 7 (link analysis)
Tutorial 7 (link analysis)Tutorial 7 (link analysis)
Tutorial 7 (link analysis)
 
Tutorial 5 (lucene)
Tutorial 5 (lucene)Tutorial 5 (lucene)
Tutorial 5 (lucene)
 
Tutorial 4 (duplicate detection)
Tutorial 4 (duplicate detection)Tutorial 4 (duplicate detection)
Tutorial 4 (duplicate detection)
 
Tutorial 3 (b tree min heap)
Tutorial 3 (b tree min heap)Tutorial 3 (b tree min heap)
Tutorial 3 (b tree min heap)
 
Tutorial 2 (mle + language models)
Tutorial 2 (mle + language models)Tutorial 2 (mle + language models)
Tutorial 2 (mle + language models)
 
Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)
 
Tutorial 13 (explicit ugc + sentiment analysis)
Tutorial 13 (explicit ugc + sentiment analysis)Tutorial 13 (explicit ugc + sentiment analysis)
Tutorial 13 (explicit ugc + sentiment analysis)
 

Último

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 

Último (20)

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 

Tutorial 6 (web graph attributes)

  • 2. 2 The Web as a Graph Pages as graph nodes, hyperlinks as edges. – Sometimes sites are taken as the nodes Some natural questions: 1. Distribution of the number of in-links to a page. 2. Distribution of the number of out-links from a page. 3. Distribution of the number of pages in a site. 4. Connectivity: is it possible to reach most pages from most pages? 5. Is there a theoretical model that fits the graph?
  • 3. 3 Mathematical Background: Power-Law Distributions • A non-negative random variable X is said to have a Power-Law distribution if, for some constants c>0 and α>0: Prob[X>x] ~ x-α, or equivalently f(x) ~ x-(α+1) • Taking logs from both sides, we have: log Prob[X>x] = -αlog(x) + c • Power Law distributions have “heavy/long tails”, i.e. the probability mass of events whose value is far from the expectancy or median of the distribution is significant – Unlike Normal or Geometric/Exponential distributions, where the probability mass of the tail decreases exponentially, in Power Law distributions the mass of the tail decreases by the constant power of α – Another point of view: in an Exponential distribution, f(x)/p(x+k) is constant, whereas in a Power-Law distribution, f(x)/f(kx) is constant. – The “average” quantity in a Power-Law distribution is not “typical” • Examples of Power-Law distributions are Pareto and Zipf distributions (see next slides)
  • 4. 4 Mathematical Background: The Pareto Distribution • A continuous, positive random variable X in the range [L,] is said to be distributed Pareto(L,k) if its probability density function is: f(X=x;k;L) = k Lk / xk+1 • This implies that Prob(X>x) = (L/x)k – Has finite expectancy of Lk/(k-1) only for k>1 – Has finite variance only for k>2 • Named after the Italian economist Vilfredo Pareto (1848- 1923), who modeled with it the distribution of wealth in society – Most people have little income; 20% of society holds 80% of the wealth
  • 5. 5 Mathematical Background: Zipf’s Law • A random variable X follows Zipf’s Law (is “Zipfian”) with parameter α when the j’th most popular value of X occurs with probability that is proportional to j-α – Essentially the distribution is over the discrete ranks • Whenever α>1, X may take an infinite number of values (i.e. have infinitely many different value popularities) • Named after the American Linguist George Kingsley Zipf (1902-1950), who observed it on the frequencies of words in the English language – On a large corpus of English text, the 135 most frequently occurring words accounted for half of the text
  • 6. 6 Mathematical Background:An Observed Zipfian Sample Implies a Power-Law The following analysis is due to Lada Adamic: • Assume that N units of wealth (coins) are distributed to M individuals – There are N observations of a random variable Y that can take on the discrete values 1,2,…,M • Yk=j (k=1,…N, j=1..M) means that person j got coin k – Denote by X1[Xm] the number of coins of the richest[poorest] individual at the end of the process • For simplicity, assume that N>>M and the Xj’s are all distinct • Assume that a perfect Zipfian behavior is observed, i.e. Xr/N ~ r-b for all r=1,…M – This trivially implies Xr ~ r-b
  • 7. 7 Mathematical Background:An Observed Zipfian Sample Implies a Power-Law (cont.) • Recap: we distributed N coins to M individuals, and denoted by X1[Xm] the number of coins of the richest[poorest] individual at the end of the process • By assuming Zipfian wealth: Xr ~ r-b, or Xr=cr-b • Let Z be the random variable of a person’s wealth, i.e. the number of coins a person gets by this process • Observation: if the r’th richest person got Xr coins, then exactly r people out of M got Xr coins or more • Pr[Z  Xr]=Pr[Z  cr-b]=r/M • Define y= cr-b, and so r=(y/c)-(1/b), and so Pr[Z  y]= y-(1/b) c(1/b)/M • Hence Pr[Z  y] ~ y-(1/b), and Z obeys a Power-Law
  • 8. 8 Distribution of Inlinks * Image taken from “Graph Structure in the Web”, Broder et al., WWW’2000. A plot of the number of nodes having each value of in-degree Both axes are in log-scale Denoting the size of the sample crawl by N (over 200M here), we have: Log (N*Prob[node has in-degree x])  -a*log(x)+c Log (Prob[node has in-degree x])  -a*log(x)+c’ Which indicates the Power-Law Prob[node has in-degree x] ~ x-a Note that the number of nodes with small in-degree is over-estimated while the number of nodes with very high in-degree is under-estimated
  • 9. 9 More Power-Laws on the Web We’ve seen that the in-degree of pages exhibits a Power-Law. Furthermore: • Out-degree (somewhat surprising) • Degrees of the inter-host graph • Number of pages in Web sites • Number of visits to Web sites/pages • PageRank scores – With an exponent very close to that of the in-degree distribution – Curiously, degrees in the telephone call graph have the same 2.1 exponent • Frequencies of words (as observed by Zipf) • Popularities of queries submitted to search engines (will be discussed later in the course)
  • 10. 10 The Web as a Graph Connectivity: is it possible to reach most pages from most pages? The Web is a bow-tie! The Web graph is also scale-free, fractal: many slices and subgraphs exhibit similar properties. Image taken from “Graph Structure in the Web”, Broder et al., WWW’2000.
  • 11. 11 Self-Similarity on the Web Dill et al., ACM TOIT 2002 • Created large Thematically Unified Clusters (TUCs) • Pages containing a certain keyword • Pages of large Web sites/Intranets • Pages containing a geographical reference in the Western US • The host graph • In general, the TUCs display very similar graph properties, e.g. • In/out degree distributions • Bow-tie structure (relative sizes of the components) • Also discovered that the SCC of the different TUCs are strongly connected, i.e. it is possible to browse between the TUCs