The Graph Structure of the Web - Aggregated by Pay-Level Domain
1. The Graph Structure of the Web
- Aggregated by Pay-Level Domain
Oliver Lehmberg, Robert Meusel, Christian Bizer
Research Group Data and Web Science
2. General Knowledge about the Web Graph
• Broder et al.* in 2000:
– In- and Outdegree follow power laws
– There is a directed path between two pages in 25% of all cases
– The Web Graph has the bow-tie structure
Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain
Lehmberg/Meusel/Bizer
*A. Broder, R. Kumar, F. Maghoul, P. Raghavan, R. Stata, A. Tomkins, and J. Wiener. Graph structure in the web.
In WWW’00, pages 309–320. North-Holland Publishing Co, 2000.
Slide 2
3. Our Contributions
• R. Meusel, S. Vigna, O. Lehmberg, and C. Bizer. Graph structure
in the web – revisted. WWW ’14, 2014.
– Analysis of the 2012 Web Graph on page level
• This presentation:
– Analysis of the same graph, aggregated by pay-level domain (PLD)
– Focus on inter-website connections
– No intra-website links
• Additionally:
– Interconnections between topical groups of websites
– Public Suffix aggregation
Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain
Lehmberg/Meusel/Bizer
Slide 3
4. DATA SET
Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain
Lehmberg/Meusel/Bizer
Slide 4
5. Web Data Commons Hyperlink Graph
• Page level: the largest hyperlink graph available to the public
– extracted from Common Crawl
– 3.5 billion nodes (web pages)
– 128 billion arcs (hyperlinks)
• Aggregated by pay-level domain
– 43 million nodes (websites)
– 623 million arcs (aggregated hyperlinks)
– 240 million registered domains in the Web in 2012 (18%)*
• Pay-level domain:
– dws.informatik.uni-mannheim.de uni-mannheim.de
Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain
Lehmberg/Meusel/Bizer
*http://www.verisigninc.com/assets/domain-name-brief-oct2012.pdf
Slide 5
6. Downloading the WDC Hyperlink Graph
Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain
Lehmberg/Meusel/Bizer
http://webdatacommons.org/hyperlinkgraph/
• 4 aggregation levels:
• Extraction code is published under Apache License
– Extraction costs per run: ~ 200 US$ in Amazon EC2 fees
Graph #Nodes #Arcs Size (zipped)
Page graph 3.56 billion 128.73 billion 376 GB
Subdomain graph 101 million 2,043 million 10 GB
1st level subdomain graph 95 million 1,937 million 9.5 GB
PLD graph 43 million 623 million 3.1 GB
Slide 6
12. In- and Outdegree – Power-Laws?
Power-Law:
𝑦 ∝ 𝑥−𝛾
Methodology:
• Clauset et al.*
Maximum-
likelihood fitting
(plfit *²)
• Goodness-of-fit
test
Indegree results:
𝑥0 = 3,062
𝛾 = 2.40
Cannot reject
power law
hypothesis
Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain
Lehmberg/Meusel/Bizer
Slide 12
* Clauset et al.: Power-Law Distributions in Empirical Data. SIAM Review 2009.*² https://github.com/ntamas/plfit
13. In- and Outdegree – Power-Laws?
Outdegree results:
𝑥0 = 496
𝛾 = 2.39
Must reject power
law hypothesis
Yet unclear which
distribution fits
Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain
Lehmberg/Meusel/Bizer
Slide 13
14. Bow-Tie Structure
Observations:
Small IN component
Large OUT component
TEND and TUBES almost non-
existent
Compared to Broder et al.:
Unbalanced
LSCC much larger
Compared to our page graph*:
Proportions of IN and OUT
exchanged
Large fraction of IN pages were
merged into LSCC (ca. 1 billion
pages)
Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain
Lehmberg/Meusel/Bizer
* R. Meusel, S. Vigna, O. Lehmberg, and C. Bizer. Graph structure in the web – revisted. WWW ’14, 2014.
Slide 14
15. Distance Distribution
Methodology:
Approximate distribution
several times (using
Hyperball*)
Connected pairs:
42.42(±3.59)%
Avg. distance:
4.27(±0.085)
Diameter (at least):
48
Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain
Lehmberg/Meusel/Bizer
*P. Boldi and S. Vigna. In-core computation of geometric centralities with HyperBall:
A hundred billion nodes and beyond. In ICDMW 2013. IEEE, 2013
Slide 15
16. High connectivity based on Hubs?
• LSCC of 51.9%, 42% connected pairs & avg. distance of 4.27
– How important are hubs in this graph?
• Approach:
– A) Remove links to Hubs (i.e. high indegree)
– B) Keep only links to Hubs
– Repeat this for different indegree values as thresholds and then
measure largest remaining WCC/SCC
• Results
– Removing links to nodes with high indegree: no large SCC once all links
to nodes with indegree 10 or higher are removed
– Removing links to nodes with low indegree: the more links we remove,
the more likely are the remaining nodes to be part of the largest SCC
Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain
Lehmberg/Meusel/Bizer
Slide 16
17. Two Layer Model
7/4/2014 Data and Web Science Group 17
Approach:
Remove incoming links from the
graph and measures sizes of
largest SCC/WCC
Subgraph with indegree < 𝟏𝟎
• 73.7% of all nodes weakly
connected
• No large strongly connected
component
• Low Degree Layer
Subgraph with indegree ≥ 𝟏𝟎
• Removed incoming links of
79.2% of all nodes
• 16.1% of all nodes strongly
connected
• High Degree Layer
18. PLD Topic Graph
Approach:
Use topical categories from the
open directory project* to
categorise our websites.
15 topical categories
Results:
“computers”: 6th largest, but largest
number of links
“shopping”: much more incoming
than outgoing links, few internal
links
Conclusion:
No obvious patterns, more
properties needed
Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain
Lehmberg/Meusel/Bizer
health Kids and teens
news
Slide 18
*http://dmoz.org
19. Public Suffix (PS) Graph
Approach:
Top ten PSs from our PLD graph +
“others”
Generally agrees with Verisign
Domain Industry Brief*
gTLDs:
more external than internal links
ccTLDs:
more internal than external links
Extreme cases:
.com does not follow this rule
.de half of all links are from a
single spammer
Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain
Lehmberg/Meusel/Bizer
co.uk ru
others
org
nl
net
it
info
de
com
*http://www.verisigninc.com/assets/domain-name-brief-oct2012.pdf
Slide 19
20. WebDataCommons.org also offers:
1.Corpus of 17 billion RDFa, Microdata, Microformats statements
2.Corpus of 147 million relational HTML tables
Thank you for your attention!
Advertisement
The Graph Structure in the Web - Aggregated by Pay-Level Domain
Lehmberg/Meusel/Bizer