SlideShare una empresa de Scribd logo
1 de 24
Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 1
Graph Structure in the Web
Revisited
Robert Meusel, Sebastiano Vigna,
Oliver Lehmberg, Christian Bizer
Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 2
Textbook Knowledge about the Web Graph
Broder et al.: Graph structure in the Web. WWW2000.
used two AltaVista crawls (200 million pages, 1.5 billion links)
Results
Power Laws Bow-Tie
Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 3
This talk will:
1. Show that the textbook knowledge might
be wrong or dependent on crawling process.
2. Provide you with a large recent Web graph
to do further research.
Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 4
Outline
1. Public Web Crawls
2. The Web Data Commons Hyperlink Graph
3. Analysis of the Graph
1. In-degree & Out-degree Distributions
2. Node Centrality
3. Strong Components
4. Bow Tie
5. Reachability and Average Shortest Path
4. Conclusion
Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 5
Public Web Crawls
1. AltaVista Crawl distributed by Yahoo! WebScope 2002
• Size: 1.4 billion pages
• Problem: Largest strongly connected component 4%
2. ClueWeb 2009
• Size: 1 billion pages
• Problem: Largest strongly connected component 3%
3. ClueWeb 2012
• Size: 733 million pages
• Largest strongly connected component 76%
• Problem: Only English pages
Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 6
The Common Crawl
Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 7
The Common Crawl Foundation
Regularly publishes Web crawls on Amazon S3.
Five crawls available so far:
Crawling Strategy (Spring 2012)
• breadth-first visiting strategy
• at least 71 million seeds from previous crawls and from Wikipedia
Date # Pages
2010 2.5 billion
Spring 2012 3.5 billion
Spring 2013 2.0 billion
Winter 2013 2.0 billion
Spring 2014 2.5 billion
Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 8
Web Data Commons – Hyperlink Graph
extracted from the Spring 2012 version of the Common Crawl
size
3.5 billion nodes
128 billion arcs
pages originate from 43 million pay-level domains (PLDs)
• 240 million PLDs were registered in 2012 * (18%)
world-wide coverage
* http://www.verisigninc.com/assets/domain-name-brief-oct2012.pdf
Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 9
Downloading the WDC Hyperlink Graph
http://webdatacommons.org/hyperlinkgraph/
4 aggregation levels:
Extraction code is published under Apache License
• Extraction costs per run: ~ 200 US$ in Amazon EC2 fees
Graph #Nodes #Arcs Size (zipped)
Page graph 3.56 billion 128.73 billion 376 GB
Subdomain graph 101 million 2,043 million 10 GB
1st level subdomain graph 95 million 1,937 million 9.5 GB
PLD graph 43 million 623 million 3.1 GB
Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 10
Analysis of the Graph
Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 11
In-Degree Distribution
Broder et al. (2000)
Power law with exponent 2.1
WDC Hyperlink Graph (2012)
Best power law exponent 2.24
Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 12
In-Degree Distribution
Power law
fitted using
plfit-tool.
Maximum
likelihood
fitting.
Starting
degree:
1129
Best power
law exponent:
2.24
Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 13
Goodness of Fit Test
Method
• Clauset et al.:
Power-Law Distributions in Empirical Data. SIAM Review 2009.
• p-value < 0.1  power law not a plausible hypothesis
Goodness of fit result
• p-value = 0
Conclusions:
• in-degree does not follow power law
• in-degree has non-fat heavy-tailed distribution
• maybe log-normal?
Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 14
Out-Degree Distribution
Broder et al.:
Power law
exponent 2.78
WDC:
Best power law
exponent 2.77
p-value = 0
Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 15
Node Centrality
http://wwwranking.webdatacommons.org
Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 16
Average Degree
Broder et al. 2000: 7.5
WDC 2012: 36.8
 Factor 4.9 larger
Possible explanation: HTML templates of CMS
Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 17
Strongly Connected Components
Calculated using WebGraph framework on a machine with 1 TB RAM.
Largest SCC
Broder: 27.7%
WDC: 51.3 %
 Factor 1.8 larger
Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 18
The Bow-Tie Structure of Broder et al. 2000
Balanced size of IN and OUT: 21%
Size of LSCC: 27%
Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 19
The Bow-Tie Structure of WDC Hyperlinkgraph 2012
IN much larger than OUT: 31% vs. 6%
LSCC much larger: 51%
Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 20
Zhu et al. WWW2008
The Chinese web looks like a tea-pot.
Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 21
Reachability and Average Shortest Path
Broder et al. 2000
Pairs of pages connected by
path: 25%
Average shortest path: 16.12
WDC Webgraph 2012
Pairs of pages connected by
path: 48%
Average shortest path: 12.84
Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 22
Conclusions
1. Web has become more dense and more connected
• Average degree has grown significantly in last 13 years (factor 5)
• Connectivity between pairs of pages has doubled
2. Macroscopic structure
• There is large SCC of growing size.
• The shape of the bow-tie seems to depend on the crawl
3. In- and out-degree distributions do not follow power laws.
Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 23
Questions?
Advertisement
WebDataCommons.org also offers:
1. Corpus of 17 billion RDFa, Microdata, Microformats statements
2. Corpus of 147 million relational HTML tables
Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 24

Más contenido relacionado

La actualidad más candente

Opinion-based Article Ranking for Information Retrieval Systems: Factoids and...
Opinion-based Article Ranking for Information Retrieval Systems: Factoids and...Opinion-based Article Ranking for Information Retrieval Systems: Factoids and...
Opinion-based Article Ranking for Information Retrieval Systems: Factoids and...
Koray Tugberk GUBUR
 

La actualidad más candente (20)

Big Query Basics
Big Query BasicsBig Query Basics
Big Query Basics
 
Using SEO as a PR metric - May 3 2022.pptx
Using SEO as a PR metric - May 3 2022.pptxUsing SEO as a PR metric - May 3 2022.pptx
Using SEO as a PR metric - May 3 2022.pptx
 
Locality Sensitive Hashing By Spark
Locality Sensitive Hashing By SparkLocality Sensitive Hashing By Spark
Locality Sensitive Hashing By Spark
 
Neo4j Training Modeling
Neo4j Training ModelingNeo4j Training Modeling
Neo4j Training Modeling
 
El poder del estilo para impactar tu SEO - César Aparicio, Cráneo Previlegiad...
El poder del estilo para impactar tu SEO - César Aparicio, Cráneo Previlegiad...El poder del estilo para impactar tu SEO - César Aparicio, Cráneo Previlegiad...
El poder del estilo para impactar tu SEO - César Aparicio, Cráneo Previlegiad...
 
hive HBase Metastore - Improving Hive with a Big Data Metadata Storage
hive HBase Metastore - Improving Hive with a Big Data Metadata Storagehive HBase Metastore - Improving Hive with a Big Data Metadata Storage
hive HBase Metastore - Improving Hive with a Big Data Metadata Storage
 
Ir 02
Ir   02Ir   02
Ir 02
 
Creating interest based content for google discover
Creating interest based content for google discoverCreating interest based content for google discover
Creating interest based content for google discover
 
Opinion-based Article Ranking for Information Retrieval Systems: Factoids and...
Opinion-based Article Ranking for Information Retrieval Systems: Factoids and...Opinion-based Article Ranking for Information Retrieval Systems: Factoids and...
Opinion-based Article Ranking for Information Retrieval Systems: Factoids and...
 
What Google doesnt know - Brighton[24].pdf
What Google doesnt know - Brighton[24].pdfWhat Google doesnt know - Brighton[24].pdf
What Google doesnt know - Brighton[24].pdf
 
Power BI Single Page Applications Boise Code Camp 2017
Power BI Single Page Applications Boise Code Camp 2017Power BI Single Page Applications Boise Code Camp 2017
Power BI Single Page Applications Boise Code Camp 2017
 
Antifragility in Digital Marketing
Antifragility in Digital MarketingAntifragility in Digital Marketing
Antifragility in Digital Marketing
 
ENEL Electricity Grids on Neo4j Graph DB
ENEL Electricity Grids on Neo4j Graph DBENEL Electricity Grids on Neo4j Graph DB
ENEL Electricity Grids on Neo4j Graph DB
 
AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...
AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...
AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...
 
Big query
Big queryBig query
Big query
 
How to Build a Semantic Search System
How to Build a Semantic Search SystemHow to Build a Semantic Search System
How to Build a Semantic Search System
 
Introduction to Dremio
Introduction to DremioIntroduction to Dremio
Introduction to Dremio
 
ESWC 2017 Tutorial Knowledge Graphs
ESWC 2017 Tutorial Knowledge GraphsESWC 2017 Tutorial Knowledge Graphs
ESWC 2017 Tutorial Knowledge Graphs
 
Relational to Big Graph
Relational to Big GraphRelational to Big Graph
Relational to Big Graph
 
bigquery.pptx
bigquery.pptxbigquery.pptx
bigquery.pptx
 

Destacado

Destacado (7)

Graph Structure In The Web
Graph Structure In The WebGraph Structure In The Web
Graph Structure In The Web
 
Tutorial 6 (web graph attributes)
Tutorial 6 (web graph attributes)Tutorial 6 (web graph attributes)
Tutorial 6 (web graph attributes)
 
Wwsss intro2016-final
Wwsss intro2016-finalWwsss intro2016-final
Wwsss intro2016-final
 
Web Science Framework and InterDataNet
Web Science Framework and InterDataNetWeb Science Framework and InterDataNet
Web Science Framework and InterDataNet
 
Graphs, Edges & Nodes - Untangling the Social Web
Graphs, Edges & Nodes - Untangling the Social WebGraphs, Edges & Nodes - Untangling the Social Web
Graphs, Edges & Nodes - Untangling the Social Web
 
Intro to Web Science (Fall 2013)
Intro to Web Science (Fall 2013)Intro to Web Science (Fall 2013)
Intro to Web Science (Fall 2013)
 
Titan: The Rise of Big Graph Data
Titan: The Rise of Big Graph DataTitan: The Rise of Big Graph Data
Titan: The Rise of Big Graph Data
 

Similar a Graph Structure in the Web - Revisited. WWW2014 Web Science Track

Creating knowledge out of interlinked data
Creating knowledge out of interlinked dataCreating knowledge out of interlinked data
Creating knowledge out of interlinked data
Sören Auer
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)
IJERD Editor
 
Deepak semantic web_iitd
Deepak semantic web_iitdDeepak semantic web_iitd
Deepak semantic web_iitd
Deepak Shevani
 
An End User Perspective on Implementing Oracle in the Engineering Environment
An End User Perspective on Implementing Oracle in the Engineering EnvironmentAn End User Perspective on Implementing Oracle in the Engineering Environment
An End User Perspective on Implementing Oracle in the Engineering Environment
jeffhobbs
 

Similar a Graph Structure in the Web - Revisited. WWW2014 Web Science Track (20)

The Graph Structure of the Web - Aggregated by Pay-Level Domain
The Graph Structure of the Web - Aggregated by Pay-Level DomainThe Graph Structure of the Web - Aggregated by Pay-Level Domain
The Graph Structure of the Web - Aggregated by Pay-Level Domain
 
Creating knowledge out of interlinked data
Creating knowledge out of interlinked dataCreating knowledge out of interlinked data
Creating knowledge out of interlinked data
 
Web Page Recommendation Using Web Mining
Web Page Recommendation Using Web MiningWeb Page Recommendation Using Web Mining
Web Page Recommendation Using Web Mining
 
Preparation of Web Mapping Application of Balephi-B Hydropower Project
Preparation of Web Mapping Application of Balephi-B Hydropower ProjectPreparation of Web Mapping Application of Balephi-B Hydropower Project
Preparation of Web Mapping Application of Balephi-B Hydropower Project
 
TR14-05_Martindell.pdf
TR14-05_Martindell.pdfTR14-05_Martindell.pdf
TR14-05_Martindell.pdf
 
Df25632640
Df25632640Df25632640
Df25632640
 
WEB2.0 And CLOUD
WEB2.0 And CLOUDWEB2.0 And CLOUD
WEB2.0 And CLOUD
 
Restructuring a Web Application, Using Spring and Hibernate
Restructuring a Web Application, Using Spring and HibernateRestructuring a Web Application, Using Spring and Hibernate
Restructuring a Web Application, Using Spring and Hibernate
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)
 
Deepak semantic web_iitd
Deepak semantic web_iitdDeepak semantic web_iitd
Deepak semantic web_iitd
 
HoLIS GIS Update
HoLIS GIS UpdateHoLIS GIS Update
HoLIS GIS Update
 
An End User Perspective on Implementing Oracle in the Engineering Environment
An End User Perspective on Implementing Oracle in the Engineering EnvironmentAn End User Perspective on Implementing Oracle in the Engineering Environment
An End User Perspective on Implementing Oracle in the Engineering Environment
 
LOD2 webinar series: Virtuoso by OpenLink Software
LOD2 webinar series: Virtuoso by OpenLink SoftwareLOD2 webinar series: Virtuoso by OpenLink Software
LOD2 webinar series: Virtuoso by OpenLink Software
 
The Semantic Data Web, Sören Auer, University of Leipzig
The Semantic Data Web, Sören Auer, University of LeipzigThe Semantic Data Web, Sören Auer, University of Leipzig
The Semantic Data Web, Sören Auer, University of Leipzig
 
CTS Conference Web 2.0 Tutorial Part 1
CTS Conference Web 2.0 Tutorial Part 1CTS Conference Web 2.0 Tutorial Part 1
CTS Conference Web 2.0 Tutorial Part 1
 
WEB 2.0: BUILDING RICH INTERNET APPLICATIONS WITH PHP
WEB 2.0: BUILDING RICH INTERNET APPLICATIONS WITH PHPWEB 2.0: BUILDING RICH INTERNET APPLICATIONS WITH PHP
WEB 2.0: BUILDING RICH INTERNET APPLICATIONS WITH PHP
 
Hackference
HackferenceHackference
Hackference
 
Webware Webinar
Webware WebinarWebware Webinar
Webware Webinar
 
GeoMapFish User-Group - March 2021
GeoMapFish User-Group - March 2021GeoMapFish User-Group - March 2021
GeoMapFish User-Group - March 2021
 
11.concept for a web map implementation with faster query response
11.concept for a web map implementation with faster query response11.concept for a web map implementation with faster query response
11.concept for a web map implementation with faster query response
 

Más de Chris Bizer

Exploring the Application Potential of Relational Web Tables
Exploring the Application Potential of Relational Web TablesExploring the Application Potential of Relational Web Tables
Exploring the Application Potential of Relational Web Tables
Chris Bizer
 
Extending Tables with Data from over a Million Websites
 Extending Tables with Data from over a Million Websites Extending Tables with Data from over a Million Websites
Extending Tables with Data from over a Million Websites
Chris Bizer
 
Adoption of the Linked Data Best Practices in Different Topical Domains
Adoption of the Linked Data Best Practices in Different Topical DomainsAdoption of the Linked Data Best Practices in Different Topical Domains
Adoption of the Linked Data Best Practices in Different Topical Domains
Chris Bizer
 

Más de Chris Bizer (14)

GPT4 versus BERT: Which Foundation Model is better for Web Data Integration?
GPT4 versus BERT: Which Foundation Model is better for Web Data Integration?GPT4 versus BERT: Which Foundation Model is better for Web Data Integration?
GPT4 versus BERT: Which Foundation Model is better for Web Data Integration?
 
Integrating Product Data from the Semantic Web using Deep Learning Techniques
Integrating Product Data from the Semantic Web using Deep Learning TechniquesIntegrating Product Data from the Semantic Web using Deep Learning Techniques
Integrating Product Data from the Semantic Web using Deep Learning Techniques
 
Using the Semantic Web as Training Data for Product Matching
Using the Semantic Web as Training Data for Product MatchingUsing the Semantic Web as Training Data for Product Matching
Using the Semantic Web as Training Data for Product Matching
 
JIST2019 Keynote: Completing Knowledge Graphs using Data from the Open Web
JIST2019 Keynote: Completing Knowledge Graphs using Data from the Open WebJIST2019 Keynote: Completing Knowledge Graphs using Data from the Open Web
JIST2019 Keynote: Completing Knowledge Graphs using Data from the Open Web
 
Schema.org Annotations and Web Tables: Underexploited Semantic Nuggets on the...
Schema.org Annotations and Web Tables: Underexploited Semantic Nuggets on the...Schema.org Annotations and Web Tables: Underexploited Semantic Nuggets on the...
Schema.org Annotations and Web Tables: Underexploited Semantic Nuggets on the...
 
Is the Semantic Web what we expected? Adoption Patterns and Content-driven Ch...
Is the Semantic Web what we expected? Adoption Patterns and Content-driven Ch...Is the Semantic Web what we expected? Adoption Patterns and Content-driven Ch...
Is the Semantic Web what we expected? Adoption Patterns and Content-driven Ch...
 
Data Search and Search Joins (Universität Heidelberg 2015)
Data Search and Search Joins (Universität Heidelberg 2015)Data Search and Search Joins (Universität Heidelberg 2015)
Data Search and Search Joins (Universität Heidelberg 2015)
 
Exploring the Application Potential of Relational Web Tables
Exploring the Application Potential of Relational Web TablesExploring the Application Potential of Relational Web Tables
Exploring the Application Potential of Relational Web Tables
 
Evolving the Web into a Global Dataspace – Advances and Applications
Evolving the Web into a Global Dataspace – Advances and ApplicationsEvolving the Web into a Global Dataspace – Advances and Applications
Evolving the Web into a Global Dataspace – Advances and Applications
 
Extending Tables with Data from over a Million Websites
 Extending Tables with Data from over a Million Websites Extending Tables with Data from over a Million Websites
Extending Tables with Data from over a Million Websites
 
Adoption of the Linked Data Best Practices in Different Topical Domains
Adoption of the Linked Data Best Practices in Different Topical DomainsAdoption of the Linked Data Best Practices in Different Topical Domains
Adoption of the Linked Data Best Practices in Different Topical Domains
 
Evolving the Web into a Global Database - Advances and Applications.
Evolving the Web into a Global Database - Advances and Applications. Evolving the Web into a Global Database - Advances and Applications.
Evolving the Web into a Global Database - Advances and Applications.
 
Search Joins with the Web - ICDT2014 Invited Lecture
Search Joins with the Web - ICDT2014 Invited LectureSearch Joins with the Web - ICDT2014 Invited Lecture
Search Joins with the Web - ICDT2014 Invited Lecture
 
DBpedia - An Interlinking Hub in the Web of Data
DBpedia - An Interlinking Hub in the Web of DataDBpedia - An Interlinking Hub in the Web of Data
DBpedia - An Interlinking Hub in the Web of Data
 

Último

Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
Sérgio Sacani
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
PirithiRaju
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
gindu3009
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Sérgio Sacani
 
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
Lokesh Kothari
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
RohitNehra6
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
ssuser79fe74
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
PirithiRaju
 

Último (20)

All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
 
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
 
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
 
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 

Graph Structure in the Web - Revisited. WWW2014 Web Science Track

  • 1. Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 1 Graph Structure in the Web Revisited Robert Meusel, Sebastiano Vigna, Oliver Lehmberg, Christian Bizer
  • 2. Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 2 Textbook Knowledge about the Web Graph Broder et al.: Graph structure in the Web. WWW2000. used two AltaVista crawls (200 million pages, 1.5 billion links) Results Power Laws Bow-Tie
  • 3. Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 3 This talk will: 1. Show that the textbook knowledge might be wrong or dependent on crawling process. 2. Provide you with a large recent Web graph to do further research.
  • 4. Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 4 Outline 1. Public Web Crawls 2. The Web Data Commons Hyperlink Graph 3. Analysis of the Graph 1. In-degree & Out-degree Distributions 2. Node Centrality 3. Strong Components 4. Bow Tie 5. Reachability and Average Shortest Path 4. Conclusion
  • 5. Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 5 Public Web Crawls 1. AltaVista Crawl distributed by Yahoo! WebScope 2002 • Size: 1.4 billion pages • Problem: Largest strongly connected component 4% 2. ClueWeb 2009 • Size: 1 billion pages • Problem: Largest strongly connected component 3% 3. ClueWeb 2012 • Size: 733 million pages • Largest strongly connected component 76% • Problem: Only English pages
  • 6. Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 6 The Common Crawl
  • 7. Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 7 The Common Crawl Foundation Regularly publishes Web crawls on Amazon S3. Five crawls available so far: Crawling Strategy (Spring 2012) • breadth-first visiting strategy • at least 71 million seeds from previous crawls and from Wikipedia Date # Pages 2010 2.5 billion Spring 2012 3.5 billion Spring 2013 2.0 billion Winter 2013 2.0 billion Spring 2014 2.5 billion
  • 8. Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 8 Web Data Commons – Hyperlink Graph extracted from the Spring 2012 version of the Common Crawl size 3.5 billion nodes 128 billion arcs pages originate from 43 million pay-level domains (PLDs) • 240 million PLDs were registered in 2012 * (18%) world-wide coverage * http://www.verisigninc.com/assets/domain-name-brief-oct2012.pdf
  • 9. Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 9 Downloading the WDC Hyperlink Graph http://webdatacommons.org/hyperlinkgraph/ 4 aggregation levels: Extraction code is published under Apache License • Extraction costs per run: ~ 200 US$ in Amazon EC2 fees Graph #Nodes #Arcs Size (zipped) Page graph 3.56 billion 128.73 billion 376 GB Subdomain graph 101 million 2,043 million 10 GB 1st level subdomain graph 95 million 1,937 million 9.5 GB PLD graph 43 million 623 million 3.1 GB
  • 10. Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 10 Analysis of the Graph
  • 11. Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 11 In-Degree Distribution Broder et al. (2000) Power law with exponent 2.1 WDC Hyperlink Graph (2012) Best power law exponent 2.24
  • 12. Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 12 In-Degree Distribution Power law fitted using plfit-tool. Maximum likelihood fitting. Starting degree: 1129 Best power law exponent: 2.24
  • 13. Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 13 Goodness of Fit Test Method • Clauset et al.: Power-Law Distributions in Empirical Data. SIAM Review 2009. • p-value < 0.1  power law not a plausible hypothesis Goodness of fit result • p-value = 0 Conclusions: • in-degree does not follow power law • in-degree has non-fat heavy-tailed distribution • maybe log-normal?
  • 14. Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 14 Out-Degree Distribution Broder et al.: Power law exponent 2.78 WDC: Best power law exponent 2.77 p-value = 0
  • 15. Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 15 Node Centrality http://wwwranking.webdatacommons.org
  • 16. Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 16 Average Degree Broder et al. 2000: 7.5 WDC 2012: 36.8  Factor 4.9 larger Possible explanation: HTML templates of CMS
  • 17. Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 17 Strongly Connected Components Calculated using WebGraph framework on a machine with 1 TB RAM. Largest SCC Broder: 27.7% WDC: 51.3 %  Factor 1.8 larger
  • 18. Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 18 The Bow-Tie Structure of Broder et al. 2000 Balanced size of IN and OUT: 21% Size of LSCC: 27%
  • 19. Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 19 The Bow-Tie Structure of WDC Hyperlinkgraph 2012 IN much larger than OUT: 31% vs. 6% LSCC much larger: 51%
  • 20. Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 20 Zhu et al. WWW2008 The Chinese web looks like a tea-pot.
  • 21. Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 21 Reachability and Average Shortest Path Broder et al. 2000 Pairs of pages connected by path: 25% Average shortest path: 16.12 WDC Webgraph 2012 Pairs of pages connected by path: 48% Average shortest path: 12.84
  • 22. Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 22 Conclusions 1. Web has become more dense and more connected • Average degree has grown significantly in last 13 years (factor 5) • Connectivity between pairs of pages has doubled 2. Macroscopic structure • There is large SCC of growing size. • The shape of the bow-tie seems to depend on the crawl 3. In- and out-degree distributions do not follow power laws.
  • 23. Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 23 Questions? Advertisement WebDataCommons.org also offers: 1. Corpus of 17 billion RDFa, Microdata, Microformats statements 2. Corpus of 147 million relational HTML tables
  • 24. Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 24