SlideShare una empresa de Scribd logo
1 de 45
Descargar para leer sin conexión
Large-Scale Analysis of Web Pages
− on a Startup Budget?
Hannes Mühleisen, Web-Based Systems Group




AWS Summit 2012 | Berlin
Our Starting Point




        2
Our Starting Point
•   Websites now embed structured data in HTML




                             2
Our Starting Point
•   Websites now embed structured data in HTML

•   Various Vocabularies possible

    •   schema.org, Open Graph protocol, ...




                                 2
Our Starting Point
•   Websites now embed structured data in HTML

•   Various Vocabularies possible

    •   schema.org, Open Graph protocol, ...

•   Various Encoding Formats possible

    •   μFormats, RDFa, Microdata



                                 2
Our Starting Point
•   Websites now embed structured data in HTML

•   Various Vocabularies possible

    •   schema.org, Open Graph protocol, ...

•   Various Encoding Formats possible

    •   μFormats, RDFa, Microdata


Question: How are Vocabularies and Formats used?
                                 2
Web Indices

•   To answer our question, we need to access to raw Web data.




                               3
Web Indices

•   To answer our question, we need to access to raw Web data.

•   However, maintaining Web indices is insanely expensive

    •   Re-Crawling, Storage, currently ~50 B pages (Google)




                                 3
Web Indices

•   To answer our question, we need to access to raw Web data.

•   However, maintaining Web indices is insanely expensive

    •   Re-Crawling, Storage, currently ~50 B pages (Google)

•   Google and Bing have indices, but do not let outsiders in



                                 3
•   Non-Profit Organization




                              4
•   Non-Profit Organization

•   Runs crawler and provides HTML dumps




                              4
•   Non-Profit Organization

•   Runs crawler and provides HTML dumps

•   Available data:

    •   Index 02-12: 1.7 B URLs (21 TB)

    •   Index 09/12: 2.8 B URLs (29 TB)



                                  4
•   Non-Profit Organization

•   Runs crawler and provides HTML dumps

•   Available data:

    •   Index 02-12: 1.7 B URLs (21 TB)

    •   Index 09/12: 2.8 B URLs (29 TB)

•   Available on AWS Public Data Sets

                                  4
Why AWS?
•   Now that we have a web crawl, how do we run our analysis?

    •   Unpacking and DOM-Parsing on 50 TB? (CPU-heavy!)




                               5
Why AWS?
•   Now that we have a web crawl, how do we run our analysis?

    •   Unpacking and DOM-Parsing on 50 TB? (CPU-heavy!)

•   Preliminary analysis: 1 GB / hour / CPU possible

    •   8-CPU Desktop: 8 months

    •   64-CPU Server: 1 month

    •   100 8-CPU EC2-Instances: ~ 3 days

                                 5
Common Crawl
 Dataset Size
Common Crawl
              Dataset Size
1 CPU, 1 h
Common Crawl
                   Dataset Size
     1 CPU, 1 h

1000 € PC, 1 h
Common Crawl
                         Dataset Size
           1 CPU, 1 h

      1000 € PC, 1 h

5000 € Server, 1 h
Common Crawl
                               Dataset Size
                 1 CPU, 1 h

           1000 € PC, 1 h

     5000 € Server, 1 h




17 € EC2 Instances, 1 h
AWS Setup
•   Data Input: Read Index Splits from S3




                               7
AWS Setup
•   Data Input: Read Index Splits from S3

•   Job Coordination: SQS Message Queue




                               7
AWS Setup
•   Data Input: Read Index Splits from S3

•   Job Coordination: SQS Message Queue

•   Workers: 100 EC2 Spot Instances (c1.xlarge, ~0.17 € / h)




                               7
AWS Setup
•   Data Input: Read Index Splits from S3

•   Job Coordination: SQS Message Queue

•   Workers: 100 EC2 Spot Instances (c1.xlarge, ~0.17 € / h)

•   Result Output: Write to S3




                                 7
AWS Setup
•   Data Input: Read Index Splits from S3

•   Job Coordination: SQS Message Queue

•   Workers: 100 EC2 Spot Instances (c1.xlarge, ~0.17 € / h)

•   Result Output: Write to S3

•   Logging: SDB


                                 7
SQS                         •   Each input file queued in SQS

                            •   EC2 Workers take tasks from SQS

                            •   Workers read and write S3 buckets

                      42



                                            ...

                                      EC2



      42   43   ...                   R42         R43   ...
                       CC                                     WDC
S3
SQS                         •   Each input file queued in SQS

                            •   EC2 Workers take tasks from SQS

                            •   Workers read and write S3 buckets

                      42



                                            ...

                                      EC2



      42   43   ...                   R42         R43   ...
                       CC                                     WDC
S3
SQS                         •   Each input file queued in SQS

                            •   EC2 Workers take tasks from SQS

                            •   Workers read and write S3 buckets

                      42



                                            ...

                                      EC2



      42   43   ...                   R42         R43   ...
                       CC                                     WDC
S3
Results - Types of Data
                                                     Microdata 02/2012
                                                     RDFa 02/2012           Website Structure                23 %
                     5e+06




                                                     RDFa 2009/2010
                                                     Microdata 2009/2010
                                                                            Products, Reviews                19 %
Entity Count (log)

                     5e+05




                                                                             Movies, Music, ...              15 %
                     5e+04




                                                                                 Geodata                     8 %
                     5e+03




                                                                           People, Organizations             7 %
                             0   50     100    150                  200           2012 Microdata Breakdown
                                        Type




                                                                   9
Results - Types of Data
                                                            Microdata 02/2012
                                                            RDFa 02/2012           Website Structure                23 %
                     5e+06




                                                            RDFa 2009/2010
                                                            Microdata 2009/2010
                                                                                   Products, Reviews                19 %
Entity Count (log)

                     5e+05




                                                                                    Movies, Music, ...              15 %
                     5e+04




                                                                                        Geodata                     8 %
                     5e+03




                                                                                  People, Organizations             7 %
                             0       50      100      150                  200           2012 Microdata Breakdown
                                             Type




                                 •   Available data largely determined by major player support


                                                                          9
Results - Types of Data
                                                             Microdata 02/2012
                                                             RDFa 02/2012           Website Structure                23 %
                     5e+06




                                                             RDFa 2009/2010
                                                             Microdata 2009/2010
                                                                                    Products, Reviews                19 %
Entity Count (log)

                     5e+05




                                                                                     Movies, Music, ...              15 %
                     5e+04




                                                                                         Geodata                     8 %
                     5e+03




                                                                                   People, Organizations             7 %
                             0       50       100      150                  200           2012 Microdata Breakdown
                                             Type




                                 •   Available data largely determined by major player support

                                 •   “If Google consumes it, we will publish it”
                                                                           9
Results - Formats

                                                                                                                    2009/2010


•




                                                         4
                                                                                                                    02−2012
    URLs with embedded Data: +6%




                                    Percentage of URLs

                                                         3
                                                         2
                                                         1
                                                         0
                                                             RDFa   Microdata   geo   hcalendar   hcard   hreview     XFN

                                                                                       Format




                                   10
Results - Formats

                                                                                                                    2009/2010


•




                                                         4
                                                                                                                    02−2012
    URLs with embedded Data: +6%




                                    Percentage of URLs

                                                         3
•   Microdata +14% (schema.org?)




                                                         2
                                                         1
                                                         0
                                                             RDFa   Microdata   geo   hcalendar   hcard   hreview     XFN

                                                                                       Format




                                   10
Results - Formats

                                                                                                                    2009/2010


•




                                                         4
                                                                                                                    02−2012
    URLs with embedded Data: +6%




                                    Percentage of URLs

                                                         3
•   Microdata +14% (schema.org?)




                                                         2
•

                                                         1
    RDFa +26% (Facebook?)




                                                         0
                                                             RDFa   Microdata   geo   hcalendar   hcard   hreview     XFN

                                                                                       Format




                                   10
Results - Extracted Data

•   Extracted data available for download at

    •   www.webdatacommons.org




                                11
Results - Extracted Data

•   Extracted data available for download at

    •   www.webdatacommons.org

•   Formats: RDF (~90 GB) and CSV Tables for Microformats (!)




                                11
Results - Extracted Data

•   Extracted data available for download at

    •   www.webdatacommons.org

•   Formats: RDF (~90 GB) and CSV Tables for Microformats (!)

•   Have a look!



                                11
AWS Costs

•   Ca. 5500 Machine-Hours were required

    •   1100 € billed by AWS for that




                                 12
AWS Costs

•   Ca. 5500 Machine-Hours were required

    •   1100 € billed by AWS for that

•   Cost for other services negligible *




                                 12
AWS Costs

•   Ca. 5500 Machine-Hours were required

    •   1100 € billed by AWS for that

•   Cost for other services negligible *

•   * At first, we underestimated SDB cost



                                 12
Takeaways
•   Web Data Commons now publishes the largest set of
    structured data from Web pages available




                             13
Takeaways
•   Web Data Commons now publishes the largest set of
    structured data from Web pages available

•   Large-Scale Web Analysis now possible with Common Crawl
    datasets




                             13
Takeaways
•   Web Data Commons now publishes the largest set of
    structured data from Web pages available

•   Large-Scale Web Analysis now possible with Common Crawl
    datasets

•   AWS great for massive ad-hoc computing power and
    complexity reduction




                             13
Takeaways
•   Web Data Commons now publishes the largest set of
    structured data from Web pages available

•   Large-Scale Web Analysis now possible with Common Crawl
    datasets

•   AWS great for massive ad-hoc computing power and
    complexity reduction

•   Choose your architecture wisely, test by experiment, for us
    EMR was too expensive.

                                13
Thank You!
              Questions?
            Want to hire me?


Web Resources: http://webdatacommons.org
     http://hannes.muehleisen.org

Más contenido relacionado

La actualidad más candente

Aioug connection poolsizingconcepts
Aioug connection poolsizingconceptsAioug connection poolsizingconcepts
Aioug connection poolsizingconceptsToon Koppelaars
 
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQLCompressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQLArseny Chernov
 
12-Step Program for Scaling Web Applications on PostgreSQL
12-Step Program for Scaling Web Applications on PostgreSQL12-Step Program for Scaling Web Applications on PostgreSQL
12-Step Program for Scaling Web Applications on PostgreSQLKonstantin Gredeskoul
 
An Elastic Metadata Store for eBay’s Media Platform
An Elastic Metadata Store for eBay’s Media PlatformAn Elastic Metadata Store for eBay’s Media Platform
An Elastic Metadata Store for eBay’s Media PlatformMongoDB
 
New lessons in connection management
New lessons in connection managementNew lessons in connection management
New lessons in connection managementToon Koppelaars
 
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]Malin Weiss
 
PayPal Big Data and MySQL Cluster
PayPal Big Data and MySQL ClusterPayPal Big Data and MySQL Cluster
PayPal Big Data and MySQL ClusterMat Keep
 
From 100s to 100s of Millions
From 100s to 100s of MillionsFrom 100s to 100s of Millions
From 100s to 100s of MillionsErik Onnen
 
Denver SQL Saturday The Next Frontier
Denver SQL Saturday The Next FrontierDenver SQL Saturday The Next Frontier
Denver SQL Saturday The Next FrontierKellyn Pot'Vin-Gorman
 
Why we love pgpool-II and why we hate it!
Why we love pgpool-II and why we hate it!Why we love pgpool-II and why we hate it!
Why we love pgpool-II and why we hate it!PGConf APAC
 
Scalable Web Architectures: Common Patterns and Approaches
Scalable Web Architectures: Common Patterns and ApproachesScalable Web Architectures: Common Patterns and Approaches
Scalable Web Architectures: Common Patterns and Approachesadunne
 
Azure Boot Camp 21.04.2018 SQL Server in Azure Iaas PaaS on-prem Lars Platzdasch
Azure Boot Camp 21.04.2018 SQL Server in Azure Iaas PaaS on-prem Lars PlatzdaschAzure Boot Camp 21.04.2018 SQL Server in Azure Iaas PaaS on-prem Lars Platzdasch
Azure Boot Camp 21.04.2018 SQL Server in Azure Iaas PaaS on-prem Lars PlatzdaschLars Platzdasch
 
MySQL Cluster (NDB) - Best Practices Percona Live 2017
MySQL Cluster (NDB) - Best Practices Percona Live 2017MySQL Cluster (NDB) - Best Practices Percona Live 2017
MySQL Cluster (NDB) - Best Practices Percona Live 2017Severalnines
 
From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.Taras Matyashovsky
 

La actualidad más candente (20)

Aioug connection poolsizingconcepts
Aioug connection poolsizingconceptsAioug connection poolsizingconcepts
Aioug connection poolsizingconcepts
 
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQLCompressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
 
12-Step Program for Scaling Web Applications on PostgreSQL
12-Step Program for Scaling Web Applications on PostgreSQL12-Step Program for Scaling Web Applications on PostgreSQL
12-Step Program for Scaling Web Applications on PostgreSQL
 
An Elastic Metadata Store for eBay’s Media Platform
An Elastic Metadata Store for eBay’s Media PlatformAn Elastic Metadata Store for eBay’s Media Platform
An Elastic Metadata Store for eBay’s Media Platform
 
New lessons in connection management
New lessons in connection managementNew lessons in connection management
New lessons in connection management
 
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
 
PayPal Big Data and MySQL Cluster
PayPal Big Data and MySQL ClusterPayPal Big Data and MySQL Cluster
PayPal Big Data and MySQL Cluster
 
From 100s to 100s of Millions
From 100s to 100s of MillionsFrom 100s to 100s of Millions
From 100s to 100s of Millions
 
Queues, Pools, Caches
Queues, Pools, CachesQueues, Pools, Caches
Queues, Pools, Caches
 
Hazelcast
HazelcastHazelcast
Hazelcast
 
XPages Performance Master Class - Survive in the fast lane on the Autobahn (E...
XPages Performance Master Class - Survive in the fast lane on the Autobahn (E...XPages Performance Master Class - Survive in the fast lane on the Autobahn (E...
XPages Performance Master Class - Survive in the fast lane on the Autobahn (E...
 
Wt unit 1 ppts web development process
Wt unit 1 ppts web development processWt unit 1 ppts web development process
Wt unit 1 ppts web development process
 
Drupal In The Cloud
Drupal In The CloudDrupal In The Cloud
Drupal In The Cloud
 
Denver SQL Saturday The Next Frontier
Denver SQL Saturday The Next FrontierDenver SQL Saturday The Next Frontier
Denver SQL Saturday The Next Frontier
 
Why we love pgpool-II and why we hate it!
Why we love pgpool-II and why we hate it!Why we love pgpool-II and why we hate it!
Why we love pgpool-II and why we hate it!
 
Scalable Web Architectures: Common Patterns and Approaches
Scalable Web Architectures: Common Patterns and ApproachesScalable Web Architectures: Common Patterns and Approaches
Scalable Web Architectures: Common Patterns and Approaches
 
Wt unit 3 server side technology
Wt unit 3 server side technologyWt unit 3 server side technology
Wt unit 3 server side technology
 
Azure Boot Camp 21.04.2018 SQL Server in Azure Iaas PaaS on-prem Lars Platzdasch
Azure Boot Camp 21.04.2018 SQL Server in Azure Iaas PaaS on-prem Lars PlatzdaschAzure Boot Camp 21.04.2018 SQL Server in Azure Iaas PaaS on-prem Lars Platzdasch
Azure Boot Camp 21.04.2018 SQL Server in Azure Iaas PaaS on-prem Lars Platzdasch
 
MySQL Cluster (NDB) - Best Practices Percona Live 2017
MySQL Cluster (NDB) - Best Practices Percona Live 2017MySQL Cluster (NDB) - Best Practices Percona Live 2017
MySQL Cluster (NDB) - Best Practices Percona Live 2017
 
From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.
 

Destacado

2013 11 mobile eating the world
2013 11 mobile eating the world2013 11 mobile eating the world
2013 11 mobile eating the worldBenedict Evans
 
An introduction to mixi Graph API
An introduction to mixi Graph APIAn introduction to mixi Graph API
An introduction to mixi Graph APImixiplatform
 
Information Architecture On A Large Scale
Information Architecture On A Large ScaleInformation Architecture On A Large Scale
Information Architecture On A Large Scaledglazkov
 
Web Development for Mobile: GTUG Talk at Google
Web Development for Mobile: GTUG Talk at GoogleWeb Development for Mobile: GTUG Talk at Google
Web Development for Mobile: GTUG Talk at GoogleEstelle Weyl
 
iPhone Web Applications: HTML5, CSS3 & dev tips for iPhone development
iPhone Web Applications: HTML5, CSS3 & dev tips for iPhone developmentiPhone Web Applications: HTML5, CSS3 & dev tips for iPhone development
iPhone Web Applications: HTML5, CSS3 & dev tips for iPhone developmentEstelle Weyl
 
Competitive Intelligence on a Startup Budget
Competitive Intelligence on a Startup BudgetCompetitive Intelligence on a Startup Budget
Competitive Intelligence on a Startup BudgetWendy Scherer
 
SeriesC Startup Marketing Budget Survey
SeriesC Startup Marketing Budget SurveySeriesC Startup Marketing Budget Survey
SeriesC Startup Marketing Budget SurveyRick Dolezalek
 
[EN] 7 steps to a successful International PR Campaign
[EN] 7 steps to a successful International PR Campaign[EN] 7 steps to a successful International PR Campaign
[EN] 7 steps to a successful International PR CampaignBecomewide
 
Supporting The Open Web - OSCON 2008
Supporting The Open Web - OSCON 2008Supporting The Open Web - OSCON 2008
Supporting The Open Web - OSCON 2008David Recordon
 
Brilliant PR on a Startup Budget
Brilliant PR on a Startup BudgetBrilliant PR on a Startup Budget
Brilliant PR on a Startup BudgetCourtney Myers
 
Performance Implications of Mobile Design
Performance Implications of Mobile DesignPerformance Implications of Mobile Design
Performance Implications of Mobile DesignGuy Podjarny
 
Startup - Finance and funding 1
Startup - Finance and funding 1Startup - Finance and funding 1
Startup - Finance and funding 1NowMaster Academy
 
Financial Planning for the startup CEO - Entrepreneurship 101 (2012/2013)
Financial Planning for the startup CEO - Entrepreneurship 101 (2012/2013)Financial Planning for the startup CEO - Entrepreneurship 101 (2012/2013)
Financial Planning for the startup CEO - Entrepreneurship 101 (2012/2013)MaRS Discovery District
 
Start Up Finance
Start Up FinanceStart Up Finance
Start Up FinanceGina Evans
 
Basics of Startup Financial Planning
Basics of Startup Financial PlanningBasics of Startup Financial Planning
Basics of Startup Financial PlanningeCornell
 
The Ultimate Guide to Startup Marketing
The Ultimate Guide to Startup MarketingThe Ultimate Guide to Startup Marketing
The Ultimate Guide to Startup MarketingOnboardly
 
Business Plan Powerpoint 1
Business Plan Powerpoint 1Business Plan Powerpoint 1
Business Plan Powerpoint 1haleydawn
 
SEOmoz Pitch Deck July 2011
SEOmoz Pitch Deck July 2011SEOmoz Pitch Deck July 2011
SEOmoz Pitch Deck July 2011Rand Fishkin
 

Destacado (19)

2013 11 mobile eating the world
2013 11 mobile eating the world2013 11 mobile eating the world
2013 11 mobile eating the world
 
An introduction to mixi Graph API
An introduction to mixi Graph APIAn introduction to mixi Graph API
An introduction to mixi Graph API
 
Information Architecture On A Large Scale
Information Architecture On A Large ScaleInformation Architecture On A Large Scale
Information Architecture On A Large Scale
 
Web Development for Mobile: GTUG Talk at Google
Web Development for Mobile: GTUG Talk at GoogleWeb Development for Mobile: GTUG Talk at Google
Web Development for Mobile: GTUG Talk at Google
 
iPhone Web Applications: HTML5, CSS3 & dev tips for iPhone development
iPhone Web Applications: HTML5, CSS3 & dev tips for iPhone developmentiPhone Web Applications: HTML5, CSS3 & dev tips for iPhone development
iPhone Web Applications: HTML5, CSS3 & dev tips for iPhone development
 
Competitive Intelligence on a Startup Budget
Competitive Intelligence on a Startup BudgetCompetitive Intelligence on a Startup Budget
Competitive Intelligence on a Startup Budget
 
SeriesC Startup Marketing Budget Survey
SeriesC Startup Marketing Budget SurveySeriesC Startup Marketing Budget Survey
SeriesC Startup Marketing Budget Survey
 
[EN] 7 steps to a successful International PR Campaign
[EN] 7 steps to a successful International PR Campaign[EN] 7 steps to a successful International PR Campaign
[EN] 7 steps to a successful International PR Campaign
 
Supporting The Open Web - OSCON 2008
Supporting The Open Web - OSCON 2008Supporting The Open Web - OSCON 2008
Supporting The Open Web - OSCON 2008
 
HTML5 Web Forms
HTML5 Web FormsHTML5 Web Forms
HTML5 Web Forms
 
Brilliant PR on a Startup Budget
Brilliant PR on a Startup BudgetBrilliant PR on a Startup Budget
Brilliant PR on a Startup Budget
 
Performance Implications of Mobile Design
Performance Implications of Mobile DesignPerformance Implications of Mobile Design
Performance Implications of Mobile Design
 
Startup - Finance and funding 1
Startup - Finance and funding 1Startup - Finance and funding 1
Startup - Finance and funding 1
 
Financial Planning for the startup CEO - Entrepreneurship 101 (2012/2013)
Financial Planning for the startup CEO - Entrepreneurship 101 (2012/2013)Financial Planning for the startup CEO - Entrepreneurship 101 (2012/2013)
Financial Planning for the startup CEO - Entrepreneurship 101 (2012/2013)
 
Start Up Finance
Start Up FinanceStart Up Finance
Start Up Finance
 
Basics of Startup Financial Planning
Basics of Startup Financial PlanningBasics of Startup Financial Planning
Basics of Startup Financial Planning
 
The Ultimate Guide to Startup Marketing
The Ultimate Guide to Startup MarketingThe Ultimate Guide to Startup Marketing
The Ultimate Guide to Startup Marketing
 
Business Plan Powerpoint 1
Business Plan Powerpoint 1Business Plan Powerpoint 1
Business Plan Powerpoint 1
 
SEOmoz Pitch Deck July 2011
SEOmoz Pitch Deck July 2011SEOmoz Pitch Deck July 2011
SEOmoz Pitch Deck July 2011
 

Similar a AWS Summit Berlin 2012 Talk on Web Data Commons

Cloud-based Data Lake for Analytics and AI
Cloud-based Data Lake for Analytics and AICloud-based Data Lake for Analytics and AI
Cloud-based Data Lake for Analytics and AITorsten Steinbach
 
Migrating Netflix from Datacenter Oracle to Global Cassandra
Migrating Netflix from Datacenter Oracle to Global CassandraMigrating Netflix from Datacenter Oracle to Global Cassandra
Migrating Netflix from Datacenter Oracle to Global CassandraAdrian Cockcroft
 
Microsoft Cloud BI Update 2012 for SQL Saturday Philly
Microsoft Cloud BI Update 2012 for SQL Saturday PhillyMicrosoft Cloud BI Update 2012 for SQL Saturday Philly
Microsoft Cloud BI Update 2012 for SQL Saturday PhillyMark Kromer
 
A Journey into Databricks' Pipelines: Journey and Lessons Learned
A Journey into Databricks' Pipelines: Journey and Lessons LearnedA Journey into Databricks' Pipelines: Journey and Lessons Learned
A Journey into Databricks' Pipelines: Journey and Lessons LearnedDatabricks
 
Scality S3 Server: Node js Meetup Presentation
Scality S3 Server: Node js Meetup PresentationScality S3 Server: Node js Meetup Presentation
Scality S3 Server: Node js Meetup PresentationScality
 
Meetup#2: Building responsive Symbology & Suggest WebService
Meetup#2: Building responsive Symbology & Suggest WebServiceMeetup#2: Building responsive Symbology & Suggest WebService
Meetup#2: Building responsive Symbology & Suggest WebServiceMinsk MongoDB User Group
 
Curriculum Associates Strata NYC 2017
Curriculum Associates Strata NYC 2017Curriculum Associates Strata NYC 2017
Curriculum Associates Strata NYC 2017Kristi Lewandowski
 
Curriculum Associates Strata NYC 2017
Curriculum Associates Strata NYC 2017Curriculum Associates Strata NYC 2017
Curriculum Associates Strata NYC 2017SingleStore
 
Curriculum Associates Strata NYC 2017
Curriculum Associates Strata NYC 2017Curriculum Associates Strata NYC 2017
Curriculum Associates Strata NYC 2017Kristi Lewandowski
 
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at DatabricksLessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at DatabricksDatabricks
 
Transactional writes to cloud storage with Eric Liang
Transactional writes to cloud storage with Eric LiangTransactional writes to cloud storage with Eric Liang
Transactional writes to cloud storage with Eric LiangDatabricks
 
MongoDB .local London 2019: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local London 2019: MongoDB Atlas Data Lake Technical Deep DiveMongoDB .local London 2019: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local London 2019: MongoDB Atlas Data Lake Technical Deep DiveMongoDB
 
Zenko & MetalK8s @ Dublin Docker Meetup, June 2018
Zenko & MetalK8s @ Dublin Docker Meetup, June 2018Zenko & MetalK8s @ Dublin Docker Meetup, June 2018
Zenko & MetalK8s @ Dublin Docker Meetup, June 2018Laure Vergeron
 
MongoDB in FS
MongoDB in FSMongoDB in FS
MongoDB in FSMongoDB
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksAnyscale
 
AWS to Bare Metal: Motivation, Pitfalls, and Results
AWS to Bare Metal: Motivation, Pitfalls, and ResultsAWS to Bare Metal: Motivation, Pitfalls, and Results
AWS to Bare Metal: Motivation, Pitfalls, and ResultsMongoDB
 
Database as a Service on the Oracle Database Appliance Platform
Database as a Service on the Oracle Database Appliance PlatformDatabase as a Service on the Oracle Database Appliance Platform
Database as a Service on the Oracle Database Appliance PlatformMaris Elsins
 
IBM Cloud Day January 2021 Data Lake Deep Dive
IBM Cloud Day January 2021 Data Lake Deep DiveIBM Cloud Day January 2021 Data Lake Deep Dive
IBM Cloud Day January 2021 Data Lake Deep DiveTorsten Steinbach
 
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...DataStax Academy
 
Intro to Joyent's Manta Object Storage Service
Intro to Joyent's Manta Object Storage ServiceIntro to Joyent's Manta Object Storage Service
Intro to Joyent's Manta Object Storage ServiceRod Boothby
 

Similar a AWS Summit Berlin 2012 Talk on Web Data Commons (20)

Cloud-based Data Lake for Analytics and AI
Cloud-based Data Lake for Analytics and AICloud-based Data Lake for Analytics and AI
Cloud-based Data Lake for Analytics and AI
 
Migrating Netflix from Datacenter Oracle to Global Cassandra
Migrating Netflix from Datacenter Oracle to Global CassandraMigrating Netflix from Datacenter Oracle to Global Cassandra
Migrating Netflix from Datacenter Oracle to Global Cassandra
 
Microsoft Cloud BI Update 2012 for SQL Saturday Philly
Microsoft Cloud BI Update 2012 for SQL Saturday PhillyMicrosoft Cloud BI Update 2012 for SQL Saturday Philly
Microsoft Cloud BI Update 2012 for SQL Saturday Philly
 
A Journey into Databricks' Pipelines: Journey and Lessons Learned
A Journey into Databricks' Pipelines: Journey and Lessons LearnedA Journey into Databricks' Pipelines: Journey and Lessons Learned
A Journey into Databricks' Pipelines: Journey and Lessons Learned
 
Scality S3 Server: Node js Meetup Presentation
Scality S3 Server: Node js Meetup PresentationScality S3 Server: Node js Meetup Presentation
Scality S3 Server: Node js Meetup Presentation
 
Meetup#2: Building responsive Symbology & Suggest WebService
Meetup#2: Building responsive Symbology & Suggest WebServiceMeetup#2: Building responsive Symbology & Suggest WebService
Meetup#2: Building responsive Symbology & Suggest WebService
 
Curriculum Associates Strata NYC 2017
Curriculum Associates Strata NYC 2017Curriculum Associates Strata NYC 2017
Curriculum Associates Strata NYC 2017
 
Curriculum Associates Strata NYC 2017
Curriculum Associates Strata NYC 2017Curriculum Associates Strata NYC 2017
Curriculum Associates Strata NYC 2017
 
Curriculum Associates Strata NYC 2017
Curriculum Associates Strata NYC 2017Curriculum Associates Strata NYC 2017
Curriculum Associates Strata NYC 2017
 
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at DatabricksLessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
 
Transactional writes to cloud storage with Eric Liang
Transactional writes to cloud storage with Eric LiangTransactional writes to cloud storage with Eric Liang
Transactional writes to cloud storage with Eric Liang
 
MongoDB .local London 2019: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local London 2019: MongoDB Atlas Data Lake Technical Deep DiveMongoDB .local London 2019: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local London 2019: MongoDB Atlas Data Lake Technical Deep Dive
 
Zenko & MetalK8s @ Dublin Docker Meetup, June 2018
Zenko & MetalK8s @ Dublin Docker Meetup, June 2018Zenko & MetalK8s @ Dublin Docker Meetup, June 2018
Zenko & MetalK8s @ Dublin Docker Meetup, June 2018
 
MongoDB in FS
MongoDB in FSMongoDB in FS
MongoDB in FS
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
AWS to Bare Metal: Motivation, Pitfalls, and Results
AWS to Bare Metal: Motivation, Pitfalls, and ResultsAWS to Bare Metal: Motivation, Pitfalls, and Results
AWS to Bare Metal: Motivation, Pitfalls, and Results
 
Database as a Service on the Oracle Database Appliance Platform
Database as a Service on the Oracle Database Appliance PlatformDatabase as a Service on the Oracle Database Appliance Platform
Database as a Service on the Oracle Database Appliance Platform
 
IBM Cloud Day January 2021 Data Lake Deep Dive
IBM Cloud Day January 2021 Data Lake Deep DiveIBM Cloud Day January 2021 Data Lake Deep Dive
IBM Cloud Day January 2021 Data Lake Deep Dive
 
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
 
Intro to Joyent's Manta Object Storage Service
Intro to Joyent's Manta Object Storage ServiceIntro to Joyent's Manta Object Storage Service
Intro to Joyent's Manta Object Storage Service
 

Último

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 

Último (20)

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 

AWS Summit Berlin 2012 Talk on Web Data Commons

  • 1. Large-Scale Analysis of Web Pages − on a Startup Budget? Hannes Mühleisen, Web-Based Systems Group AWS Summit 2012 | Berlin
  • 3. Our Starting Point • Websites now embed structured data in HTML 2
  • 4. Our Starting Point • Websites now embed structured data in HTML • Various Vocabularies possible • schema.org, Open Graph protocol, ... 2
  • 5. Our Starting Point • Websites now embed structured data in HTML • Various Vocabularies possible • schema.org, Open Graph protocol, ... • Various Encoding Formats possible • μFormats, RDFa, Microdata 2
  • 6. Our Starting Point • Websites now embed structured data in HTML • Various Vocabularies possible • schema.org, Open Graph protocol, ... • Various Encoding Formats possible • μFormats, RDFa, Microdata Question: How are Vocabularies and Formats used? 2
  • 7. Web Indices • To answer our question, we need to access to raw Web data. 3
  • 8. Web Indices • To answer our question, we need to access to raw Web data. • However, maintaining Web indices is insanely expensive • Re-Crawling, Storage, currently ~50 B pages (Google) 3
  • 9. Web Indices • To answer our question, we need to access to raw Web data. • However, maintaining Web indices is insanely expensive • Re-Crawling, Storage, currently ~50 B pages (Google) • Google and Bing have indices, but do not let outsiders in 3
  • 10. Non-Profit Organization 4
  • 11. Non-Profit Organization • Runs crawler and provides HTML dumps 4
  • 12. Non-Profit Organization • Runs crawler and provides HTML dumps • Available data: • Index 02-12: 1.7 B URLs (21 TB) • Index 09/12: 2.8 B URLs (29 TB) 4
  • 13. Non-Profit Organization • Runs crawler and provides HTML dumps • Available data: • Index 02-12: 1.7 B URLs (21 TB) • Index 09/12: 2.8 B URLs (29 TB) • Available on AWS Public Data Sets 4
  • 14. Why AWS? • Now that we have a web crawl, how do we run our analysis? • Unpacking and DOM-Parsing on 50 TB? (CPU-heavy!) 5
  • 15. Why AWS? • Now that we have a web crawl, how do we run our analysis? • Unpacking and DOM-Parsing on 50 TB? (CPU-heavy!) • Preliminary analysis: 1 GB / hour / CPU possible • 8-CPU Desktop: 8 months • 64-CPU Server: 1 month • 100 8-CPU EC2-Instances: ~ 3 days 5
  • 17. Common Crawl Dataset Size 1 CPU, 1 h
  • 18. Common Crawl Dataset Size 1 CPU, 1 h 1000 € PC, 1 h
  • 19. Common Crawl Dataset Size 1 CPU, 1 h 1000 € PC, 1 h 5000 € Server, 1 h
  • 20. Common Crawl Dataset Size 1 CPU, 1 h 1000 € PC, 1 h 5000 € Server, 1 h 17 € EC2 Instances, 1 h
  • 21. AWS Setup • Data Input: Read Index Splits from S3 7
  • 22. AWS Setup • Data Input: Read Index Splits from S3 • Job Coordination: SQS Message Queue 7
  • 23. AWS Setup • Data Input: Read Index Splits from S3 • Job Coordination: SQS Message Queue • Workers: 100 EC2 Spot Instances (c1.xlarge, ~0.17 € / h) 7
  • 24. AWS Setup • Data Input: Read Index Splits from S3 • Job Coordination: SQS Message Queue • Workers: 100 EC2 Spot Instances (c1.xlarge, ~0.17 € / h) • Result Output: Write to S3 7
  • 25. AWS Setup • Data Input: Read Index Splits from S3 • Job Coordination: SQS Message Queue • Workers: 100 EC2 Spot Instances (c1.xlarge, ~0.17 € / h) • Result Output: Write to S3 • Logging: SDB 7
  • 26. SQS • Each input file queued in SQS • EC2 Workers take tasks from SQS • Workers read and write S3 buckets 42 ... EC2 42 43 ... R42 R43 ... CC WDC S3
  • 27. SQS • Each input file queued in SQS • EC2 Workers take tasks from SQS • Workers read and write S3 buckets 42 ... EC2 42 43 ... R42 R43 ... CC WDC S3
  • 28. SQS • Each input file queued in SQS • EC2 Workers take tasks from SQS • Workers read and write S3 buckets 42 ... EC2 42 43 ... R42 R43 ... CC WDC S3
  • 29. Results - Types of Data Microdata 02/2012 RDFa 02/2012 Website Structure 23 % 5e+06 RDFa 2009/2010 Microdata 2009/2010 Products, Reviews 19 % Entity Count (log) 5e+05 Movies, Music, ... 15 % 5e+04 Geodata 8 % 5e+03 People, Organizations 7 % 0 50 100 150 200 2012 Microdata Breakdown Type 9
  • 30. Results - Types of Data Microdata 02/2012 RDFa 02/2012 Website Structure 23 % 5e+06 RDFa 2009/2010 Microdata 2009/2010 Products, Reviews 19 % Entity Count (log) 5e+05 Movies, Music, ... 15 % 5e+04 Geodata 8 % 5e+03 People, Organizations 7 % 0 50 100 150 200 2012 Microdata Breakdown Type • Available data largely determined by major player support 9
  • 31. Results - Types of Data Microdata 02/2012 RDFa 02/2012 Website Structure 23 % 5e+06 RDFa 2009/2010 Microdata 2009/2010 Products, Reviews 19 % Entity Count (log) 5e+05 Movies, Music, ... 15 % 5e+04 Geodata 8 % 5e+03 People, Organizations 7 % 0 50 100 150 200 2012 Microdata Breakdown Type • Available data largely determined by major player support • “If Google consumes it, we will publish it” 9
  • 32. Results - Formats 2009/2010 • 4 02−2012 URLs with embedded Data: +6% Percentage of URLs 3 2 1 0 RDFa Microdata geo hcalendar hcard hreview XFN Format 10
  • 33. Results - Formats 2009/2010 • 4 02−2012 URLs with embedded Data: +6% Percentage of URLs 3 • Microdata +14% (schema.org?) 2 1 0 RDFa Microdata geo hcalendar hcard hreview XFN Format 10
  • 34. Results - Formats 2009/2010 • 4 02−2012 URLs with embedded Data: +6% Percentage of URLs 3 • Microdata +14% (schema.org?) 2 • 1 RDFa +26% (Facebook?) 0 RDFa Microdata geo hcalendar hcard hreview XFN Format 10
  • 35. Results - Extracted Data • Extracted data available for download at • www.webdatacommons.org 11
  • 36. Results - Extracted Data • Extracted data available for download at • www.webdatacommons.org • Formats: RDF (~90 GB) and CSV Tables for Microformats (!) 11
  • 37. Results - Extracted Data • Extracted data available for download at • www.webdatacommons.org • Formats: RDF (~90 GB) and CSV Tables for Microformats (!) • Have a look! 11
  • 38. AWS Costs • Ca. 5500 Machine-Hours were required • 1100 € billed by AWS for that 12
  • 39. AWS Costs • Ca. 5500 Machine-Hours were required • 1100 € billed by AWS for that • Cost for other services negligible * 12
  • 40. AWS Costs • Ca. 5500 Machine-Hours were required • 1100 € billed by AWS for that • Cost for other services negligible * • * At first, we underestimated SDB cost 12
  • 41. Takeaways • Web Data Commons now publishes the largest set of structured data from Web pages available 13
  • 42. Takeaways • Web Data Commons now publishes the largest set of structured data from Web pages available • Large-Scale Web Analysis now possible with Common Crawl datasets 13
  • 43. Takeaways • Web Data Commons now publishes the largest set of structured data from Web pages available • Large-Scale Web Analysis now possible with Common Crawl datasets • AWS great for massive ad-hoc computing power and complexity reduction 13
  • 44. Takeaways • Web Data Commons now publishes the largest set of structured data from Web pages available • Large-Scale Web Analysis now possible with Common Crawl datasets • AWS great for massive ad-hoc computing power and complexity reduction • Choose your architecture wisely, test by experiment, for us EMR was too expensive. 13
  • 45. Thank You! Questions? Want to hire me? Web Resources: http://webdatacommons.org http://hannes.muehleisen.org