SlideShare una empresa de Scribd logo
1 de 45
Large-Scale Analysis of Web Pages
− on a Startup Budget?
Hannes Mühleisen, Web-Based Systems Group




AWS Summit 2012 | Berlin
Our Starting Point




        2
Our Starting Point
•   Websites now embed structured data in HTML




                             2
Our Starting Point
•   Websites now embed structured data in HTML

•   Various Vocabularies possible

    •   schema.org, Open Graph protocol, ...




                                 2
Our Starting Point
•   Websites now embed structured data in HTML

•   Various Vocabularies possible

    •   schema.org, Open Graph protocol, ...

•   Various Encoding Formats possible

    •   μFormats, RDFa, Microdata



                                 2
Our Starting Point
•   Websites now embed structured data in HTML

•   Various Vocabularies possible

    •   schema.org, Open Graph protocol, ...

•   Various Encoding Formats possible

    •   μFormats, RDFa, Microdata


Question: How are Vocabularies and Formats used?
                                 2
Web Indices

•   To answer our question, we need to access to raw Web data.




                               3
Web Indices

•   To answer our question, we need to access to raw Web data.

•   However, maintaining Web indices is insanely expensive

    •   Re-Crawling, Storage, currently ~50 B pages (Google)




                                 3
Web Indices

•   To answer our question, we need to access to raw Web data.

•   However, maintaining Web indices is insanely expensive

    •   Re-Crawling, Storage, currently ~50 B pages (Google)

•   Google and Bing have indices, but do not let outsiders in



                                 3
•   Non-Profit Organization




                              4
•   Non-Profit Organization

•   Runs crawler and provides HTML dumps




                              4
•   Non-Profit Organization

•   Runs crawler and provides HTML dumps

•   Available data:

    •   Index 02-12: 1.7 B URLs (21 TB)

    •   Index 09/12: 2.8 B URLs (29 TB)



                                  4
•   Non-Profit Organization

•   Runs crawler and provides HTML dumps

•   Available data:

    •   Index 02-12: 1.7 B URLs (21 TB)

    •   Index 09/12: 2.8 B URLs (29 TB)

•   Available on AWS Public Data Sets

                                  4
Why AWS?
•   Now that we have a web crawl, how do we run our analysis?

    •   Unpacking and DOM-Parsing on 50 TB? (CPU-heavy!)




                               5
Why AWS?
•   Now that we have a web crawl, how do we run our analysis?

    •   Unpacking and DOM-Parsing on 50 TB? (CPU-heavy!)

•   Preliminary analysis: 1 GB / hour / CPU possible

    •   8-CPU Desktop: 8 months

    •   64-CPU Server: 1 month

    •   100 8-CPU EC2-Instances: ~ 3 days

                                 5
Common Crawl
 Dataset Size
Common Crawl
              Dataset Size
1 CPU, 1 h
Common Crawl
                   Dataset Size
     1 CPU, 1 h

1000 € PC, 1 h
Common Crawl
                         Dataset Size
           1 CPU, 1 h

      1000 € PC, 1 h

5000 € Server, 1 h
Common Crawl
                               Dataset Size
                 1 CPU, 1 h

           1000 € PC, 1 h

     5000 € Server, 1 h




17 € EC2 Instances, 1 h
AWS Setup
•   Data Input: Read Index Splits from S3




                               7
AWS Setup
•   Data Input: Read Index Splits from S3

•   Job Coordination: SQS Message Queue




                               7
AWS Setup
•   Data Input: Read Index Splits from S3

•   Job Coordination: SQS Message Queue

•   Workers: 100 EC2 Spot Instances (c1.xlarge, ~0.17 € / h)




                               7
AWS Setup
•   Data Input: Read Index Splits from S3

•   Job Coordination: SQS Message Queue

•   Workers: 100 EC2 Spot Instances (c1.xlarge, ~0.17 € / h)

•   Result Output: Write to S3




                                 7
AWS Setup
•   Data Input: Read Index Splits from S3

•   Job Coordination: SQS Message Queue

•   Workers: 100 EC2 Spot Instances (c1.xlarge, ~0.17 € / h)

•   Result Output: Write to S3

•   Logging: SDB


                                 7
SQS                         •   Each input file queued in SQS

                            •   EC2 Workers take tasks from SQS

                            •   Workers read and write S3 buckets

                      42



                                            ...

                                      EC2



      42   43   ...                   R42         R43   ...
                       CC                                     WDC
S3
SQS                         •   Each input file queued in SQS

                            •   EC2 Workers take tasks from SQS

                            •   Workers read and write S3 buckets

                      42



                                            ...

                                      EC2



      42   43   ...                   R42         R43   ...
                       CC                                     WDC
S3
SQS                         •   Each input file queued in SQS

                            •   EC2 Workers take tasks from SQS

                            •   Workers read and write S3 buckets

                      42



                                            ...

                                      EC2



      42   43   ...                   R42         R43   ...
                       CC                                     WDC
S3
Results - Types of Data
                                                     Microdata 02/2012
                                                     RDFa 02/2012           Website Structure                23 %
                     5e+06




                                                     RDFa 2009/2010
                                                     Microdata 2009/2010
                                                                            Products, Reviews                19 %
Entity Count (log)

                     5e+05




                                                                             Movies, Music, ...              15 %
                     5e+04




                                                                                 Geodata                     8 %
                     5e+03




                                                                           People, Organizations             7 %
                             0   50     100    150                  200           2012 Microdata Breakdown
                                        Type




                                                                   9
Results - Types of Data
                                                            Microdata 02/2012
                                                            RDFa 02/2012           Website Structure                23 %
                     5e+06




                                                            RDFa 2009/2010
                                                            Microdata 2009/2010
                                                                                   Products, Reviews                19 %
Entity Count (log)

                     5e+05




                                                                                    Movies, Music, ...              15 %
                     5e+04




                                                                                        Geodata                     8 %
                     5e+03




                                                                                  People, Organizations             7 %
                             0       50      100      150                  200           2012 Microdata Breakdown
                                             Type




                                 •   Available data largely determined by major player support


                                                                          9
Results - Types of Data
                                                             Microdata 02/2012
                                                             RDFa 02/2012           Website Structure                23 %
                     5e+06




                                                             RDFa 2009/2010
                                                             Microdata 2009/2010
                                                                                    Products, Reviews                19 %
Entity Count (log)

                     5e+05




                                                                                     Movies, Music, ...              15 %
                     5e+04




                                                                                         Geodata                     8 %
                     5e+03




                                                                                   People, Organizations             7 %
                             0       50       100      150                  200           2012 Microdata Breakdown
                                             Type




                                 •   Available data largely determined by major player support

                                 •   “If Google consumes it, we will publish it”
                                                                           9
Results - Formats

                                                                                                                    2009/2010


•




                                                         4
                                                                                                                    02−2012
    URLs with embedded Data: +6%




                                    Percentage of URLs

                                                         3
                                                         2
                                                         1
                                                         0
                                                             RDFa   Microdata   geo   hcalendar   hcard   hreview     XFN

                                                                                       Format




                                   10
Results - Formats

                                                                                                                    2009/2010


•




                                                         4
                                                                                                                    02−2012
    URLs with embedded Data: +6%




                                    Percentage of URLs

                                                         3
•   Microdata +14% (schema.org?)




                                                         2
                                                         1
                                                         0
                                                             RDFa   Microdata   geo   hcalendar   hcard   hreview     XFN

                                                                                       Format




                                   10
Results - Formats

                                                                                                                    2009/2010


•




                                                         4
                                                                                                                    02−2012
    URLs with embedded Data: +6%




                                    Percentage of URLs

                                                         3
•   Microdata +14% (schema.org?)




                                                         2
•

                                                         1
    RDFa +26% (Facebook?)




                                                         0
                                                             RDFa   Microdata   geo   hcalendar   hcard   hreview     XFN

                                                                                       Format




                                   10
Results - Extracted Data

•   Extracted data available for download at

    •   www.webdatacommons.org




                                11
Results - Extracted Data

•   Extracted data available for download at

    •   www.webdatacommons.org

•   Formats: RDF (~90 GB) and CSV Tables for Microformats (!)




                                11
Results - Extracted Data

•   Extracted data available for download at

    •   www.webdatacommons.org

•   Formats: RDF (~90 GB) and CSV Tables for Microformats (!)

•   Have a look!



                                11
AWS Costs

•   Ca. 5500 Machine-Hours were required

    •   1100 € billed by AWS for that




                                 12
AWS Costs

•   Ca. 5500 Machine-Hours were required

    •   1100 € billed by AWS for that

•   Cost for other services negligible *




                                 12
AWS Costs

•   Ca. 5500 Machine-Hours were required

    •   1100 € billed by AWS for that

•   Cost for other services negligible *

•   * At first, we underestimated SDB cost



                                 12
Takeaways
•   Web Data Commons now publishes the largest set of
    structured data from Web pages available




                             13
Takeaways
•   Web Data Commons now publishes the largest set of
    structured data from Web pages available

•   Large-Scale Web Analysis now possible with Common Crawl
    datasets




                             13
Takeaways
•   Web Data Commons now publishes the largest set of
    structured data from Web pages available

•   Large-Scale Web Analysis now possible with Common Crawl
    datasets

•   AWS great for massive ad-hoc computing power and
    complexity reduction




                             13
Takeaways
•   Web Data Commons now publishes the largest set of
    structured data from Web pages available

•   Large-Scale Web Analysis now possible with Common Crawl
    datasets

•   AWS great for massive ad-hoc computing power and
    complexity reduction

•   Choose your architecture wisely, test by experiment, for us
    EMR was too expensive.

                                13
Thank You!
              Questions?
            Want to hire me?


Web Resources: http://webdatacommons.org
     http://hannes.muehleisen.org

Más contenido relacionado

La actualidad más candente

Building Data Lakes with Apache Airflow
Building Data Lakes with Apache AirflowBuilding Data Lakes with Apache Airflow
Building Data Lakes with Apache AirflowGary Stafford
 
Intro To MongoDB
Intro To MongoDBIntro To MongoDB
Intro To MongoDBAlex Sharp
 
OWASP Top 10 2021 Presentation (Jul 2022)
OWASP Top 10 2021 Presentation (Jul 2022)OWASP Top 10 2021 Presentation (Jul 2022)
OWASP Top 10 2021 Presentation (Jul 2022)TzahiArabov
 
XXE: How to become a Jedi
XXE: How to become a JediXXE: How to become a Jedi
XXE: How to become a JediYaroslav Babin
 
Unrestricted file upload CWE-434 - Adam Nurudini (ISACA)
Unrestricted file upload CWE-434 -  Adam Nurudini (ISACA)Unrestricted file upload CWE-434 -  Adam Nurudini (ISACA)
Unrestricted file upload CWE-434 - Adam Nurudini (ISACA)Adam Nurudini
 
Buffer overflow attacks
Buffer overflow attacksBuffer overflow attacks
Buffer overflow attacksJoe McCarthy
 
Docker 101 - Nov 2016
Docker 101 - Nov 2016Docker 101 - Nov 2016
Docker 101 - Nov 2016Docker, Inc.
 
DNS exfiltration using sqlmap
DNS exfiltration using sqlmapDNS exfiltration using sqlmap
DNS exfiltration using sqlmapMiroslav Stampar
 
Introducing the New DSpace User Interface
Introducing the New DSpace User InterfaceIntroducing the New DSpace User Interface
Introducing the New DSpace User InterfaceTim Donohue
 
Les technologies du Web appliquées aux données structurées (1ère partie : Enc...
Les technologies du Web appliquées aux données structurées (1ère partie : Enc...Les technologies du Web appliquées aux données structurées (1ère partie : Enc...
Les technologies du Web appliquées aux données structurées (1ère partie : Enc...Gautier Poupeau
 
DSpace standard Data model and DSpace-CRIS
DSpace standard Data model and DSpace-CRISDSpace standard Data model and DSpace-CRIS
DSpace standard Data model and DSpace-CRISAndrea Bollini
 
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheUsing Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheDremio Corporation
 
facebook architecture for 600M users
facebook architecture for 600M usersfacebook architecture for 600M users
facebook architecture for 600M usersJongyoon Choi
 
Web Hacking With Burp Suite 101
Web Hacking With Burp Suite 101Web Hacking With Burp Suite 101
Web Hacking With Burp Suite 101Zack Meyers
 
How Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryHow Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryNeo4j
 

La actualidad más candente (20)

Building Data Lakes with Apache Airflow
Building Data Lakes with Apache AirflowBuilding Data Lakes with Apache Airflow
Building Data Lakes with Apache Airflow
 
Intro To MongoDB
Intro To MongoDBIntro To MongoDB
Intro To MongoDB
 
OWASP Top 10 2021 Presentation (Jul 2022)
OWASP Top 10 2021 Presentation (Jul 2022)OWASP Top 10 2021 Presentation (Jul 2022)
OWASP Top 10 2021 Presentation (Jul 2022)
 
XXE: How to become a Jedi
XXE: How to become a JediXXE: How to become a Jedi
XXE: How to become a Jedi
 
Unrestricted file upload CWE-434 - Adam Nurudini (ISACA)
Unrestricted file upload CWE-434 -  Adam Nurudini (ISACA)Unrestricted file upload CWE-434 -  Adam Nurudini (ISACA)
Unrestricted file upload CWE-434 - Adam Nurudini (ISACA)
 
Buffer overflow attacks
Buffer overflow attacksBuffer overflow attacks
Buffer overflow attacks
 
Docker 101 - Nov 2016
Docker 101 - Nov 2016Docker 101 - Nov 2016
Docker 101 - Nov 2016
 
DNS exfiltration using sqlmap
DNS exfiltration using sqlmapDNS exfiltration using sqlmap
DNS exfiltration using sqlmap
 
Introducing the New DSpace User Interface
Introducing the New DSpace User InterfaceIntroducing the New DSpace User Interface
Introducing the New DSpace User Interface
 
CKAN overview
CKAN overviewCKAN overview
CKAN overview
 
Les technologies du Web appliquées aux données structurées (1ère partie : Enc...
Les technologies du Web appliquées aux données structurées (1ère partie : Enc...Les technologies du Web appliquées aux données structurées (1ère partie : Enc...
Les technologies du Web appliquées aux données structurées (1ère partie : Enc...
 
DSpace standard Data model and DSpace-CRIS
DSpace standard Data model and DSpace-CRISDSpace standard Data model and DSpace-CRIS
DSpace standard Data model and DSpace-CRIS
 
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheUsing Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
 
Lucene indexing
Lucene indexingLucene indexing
Lucene indexing
 
facebook architecture for 600M users
facebook architecture for 600M usersfacebook architecture for 600M users
facebook architecture for 600M users
 
Web Hacking With Burp Suite 101
Web Hacking With Burp Suite 101Web Hacking With Burp Suite 101
Web Hacking With Burp Suite 101
 
How Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryHow Lyft Drives Data Discovery
How Lyft Drives Data Discovery
 
Digital Preservation with Archivematica: An Introduction
Digital Preservation with Archivematica: An IntroductionDigital Preservation with Archivematica: An Introduction
Digital Preservation with Archivematica: An Introduction
 
Azure HDInsight
Azure HDInsightAzure HDInsight
Azure HDInsight
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 

Similar a AWS Customer Presentation: Freie Univerisitat - Berlin Summit 2012

Cloud-based Data Lake for Analytics and AI
Cloud-based Data Lake for Analytics and AICloud-based Data Lake for Analytics and AI
Cloud-based Data Lake for Analytics and AITorsten Steinbach
 
Migrating Netflix from Datacenter Oracle to Global Cassandra
Migrating Netflix from Datacenter Oracle to Global CassandraMigrating Netflix from Datacenter Oracle to Global Cassandra
Migrating Netflix from Datacenter Oracle to Global CassandraAdrian Cockcroft
 
Microsoft Cloud BI Update 2012 for SQL Saturday Philly
Microsoft Cloud BI Update 2012 for SQL Saturday PhillyMicrosoft Cloud BI Update 2012 for SQL Saturday Philly
Microsoft Cloud BI Update 2012 for SQL Saturday PhillyMark Kromer
 
A Journey into Databricks' Pipelines: Journey and Lessons Learned
A Journey into Databricks' Pipelines: Journey and Lessons LearnedA Journey into Databricks' Pipelines: Journey and Lessons Learned
A Journey into Databricks' Pipelines: Journey and Lessons LearnedDatabricks
 
Scality S3 Server: Node js Meetup Presentation
Scality S3 Server: Node js Meetup PresentationScality S3 Server: Node js Meetup Presentation
Scality S3 Server: Node js Meetup PresentationScality
 
Meetup#2: Building responsive Symbology & Suggest WebService
Meetup#2: Building responsive Symbology & Suggest WebServiceMeetup#2: Building responsive Symbology & Suggest WebService
Meetup#2: Building responsive Symbology & Suggest WebServiceMinsk MongoDB User Group
 
Curriculum Associates Strata NYC 2017
Curriculum Associates Strata NYC 2017Curriculum Associates Strata NYC 2017
Curriculum Associates Strata NYC 2017Kristi Lewandowski
 
Curriculum Associates Strata NYC 2017
Curriculum Associates Strata NYC 2017Curriculum Associates Strata NYC 2017
Curriculum Associates Strata NYC 2017SingleStore
 
Curriculum Associates Strata NYC 2017
Curriculum Associates Strata NYC 2017Curriculum Associates Strata NYC 2017
Curriculum Associates Strata NYC 2017Kristi Lewandowski
 
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at DatabricksLessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at DatabricksDatabricks
 
Transactional writes to cloud storage with Eric Liang
Transactional writes to cloud storage with Eric LiangTransactional writes to cloud storage with Eric Liang
Transactional writes to cloud storage with Eric LiangDatabricks
 
MongoDB .local London 2019: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local London 2019: MongoDB Atlas Data Lake Technical Deep DiveMongoDB .local London 2019: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local London 2019: MongoDB Atlas Data Lake Technical Deep DiveMongoDB
 
Zenko & MetalK8s @ Dublin Docker Meetup, June 2018
Zenko & MetalK8s @ Dublin Docker Meetup, June 2018Zenko & MetalK8s @ Dublin Docker Meetup, June 2018
Zenko & MetalK8s @ Dublin Docker Meetup, June 2018Laure Vergeron
 
MongoDB in FS
MongoDB in FSMongoDB in FS
MongoDB in FSMongoDB
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksAnyscale
 
AWS to Bare Metal: Motivation, Pitfalls, and Results
AWS to Bare Metal: Motivation, Pitfalls, and ResultsAWS to Bare Metal: Motivation, Pitfalls, and Results
AWS to Bare Metal: Motivation, Pitfalls, and ResultsMongoDB
 
Database as a Service on the Oracle Database Appliance Platform
Database as a Service on the Oracle Database Appliance PlatformDatabase as a Service on the Oracle Database Appliance Platform
Database as a Service on the Oracle Database Appliance PlatformMaris Elsins
 
IBM Cloud Day January 2021 Data Lake Deep Dive
IBM Cloud Day January 2021 Data Lake Deep DiveIBM Cloud Day January 2021 Data Lake Deep Dive
IBM Cloud Day January 2021 Data Lake Deep DiveTorsten Steinbach
 
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...DataStax Academy
 
Intro to Joyent's Manta Object Storage Service
Intro to Joyent's Manta Object Storage ServiceIntro to Joyent's Manta Object Storage Service
Intro to Joyent's Manta Object Storage ServiceRod Boothby
 

Similar a AWS Customer Presentation: Freie Univerisitat - Berlin Summit 2012 (20)

Cloud-based Data Lake for Analytics and AI
Cloud-based Data Lake for Analytics and AICloud-based Data Lake for Analytics and AI
Cloud-based Data Lake for Analytics and AI
 
Migrating Netflix from Datacenter Oracle to Global Cassandra
Migrating Netflix from Datacenter Oracle to Global CassandraMigrating Netflix from Datacenter Oracle to Global Cassandra
Migrating Netflix from Datacenter Oracle to Global Cassandra
 
Microsoft Cloud BI Update 2012 for SQL Saturday Philly
Microsoft Cloud BI Update 2012 for SQL Saturday PhillyMicrosoft Cloud BI Update 2012 for SQL Saturday Philly
Microsoft Cloud BI Update 2012 for SQL Saturday Philly
 
A Journey into Databricks' Pipelines: Journey and Lessons Learned
A Journey into Databricks' Pipelines: Journey and Lessons LearnedA Journey into Databricks' Pipelines: Journey and Lessons Learned
A Journey into Databricks' Pipelines: Journey and Lessons Learned
 
Scality S3 Server: Node js Meetup Presentation
Scality S3 Server: Node js Meetup PresentationScality S3 Server: Node js Meetup Presentation
Scality S3 Server: Node js Meetup Presentation
 
Meetup#2: Building responsive Symbology & Suggest WebService
Meetup#2: Building responsive Symbology & Suggest WebServiceMeetup#2: Building responsive Symbology & Suggest WebService
Meetup#2: Building responsive Symbology & Suggest WebService
 
Curriculum Associates Strata NYC 2017
Curriculum Associates Strata NYC 2017Curriculum Associates Strata NYC 2017
Curriculum Associates Strata NYC 2017
 
Curriculum Associates Strata NYC 2017
Curriculum Associates Strata NYC 2017Curriculum Associates Strata NYC 2017
Curriculum Associates Strata NYC 2017
 
Curriculum Associates Strata NYC 2017
Curriculum Associates Strata NYC 2017Curriculum Associates Strata NYC 2017
Curriculum Associates Strata NYC 2017
 
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at DatabricksLessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
 
Transactional writes to cloud storage with Eric Liang
Transactional writes to cloud storage with Eric LiangTransactional writes to cloud storage with Eric Liang
Transactional writes to cloud storage with Eric Liang
 
MongoDB .local London 2019: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local London 2019: MongoDB Atlas Data Lake Technical Deep DiveMongoDB .local London 2019: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local London 2019: MongoDB Atlas Data Lake Technical Deep Dive
 
Zenko & MetalK8s @ Dublin Docker Meetup, June 2018
Zenko & MetalK8s @ Dublin Docker Meetup, June 2018Zenko & MetalK8s @ Dublin Docker Meetup, June 2018
Zenko & MetalK8s @ Dublin Docker Meetup, June 2018
 
MongoDB in FS
MongoDB in FSMongoDB in FS
MongoDB in FS
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
AWS to Bare Metal: Motivation, Pitfalls, and Results
AWS to Bare Metal: Motivation, Pitfalls, and ResultsAWS to Bare Metal: Motivation, Pitfalls, and Results
AWS to Bare Metal: Motivation, Pitfalls, and Results
 
Database as a Service on the Oracle Database Appliance Platform
Database as a Service on the Oracle Database Appliance PlatformDatabase as a Service on the Oracle Database Appliance Platform
Database as a Service on the Oracle Database Appliance Platform
 
IBM Cloud Day January 2021 Data Lake Deep Dive
IBM Cloud Day January 2021 Data Lake Deep DiveIBM Cloud Day January 2021 Data Lake Deep Dive
IBM Cloud Day January 2021 Data Lake Deep Dive
 
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
 
Intro to Joyent's Manta Object Storage Service
Intro to Joyent's Manta Object Storage ServiceIntro to Joyent's Manta Object Storage Service
Intro to Joyent's Manta Object Storage Service
 

Más de Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

Más de Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Último

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Principled Technologies
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 

Último (20)

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 

AWS Customer Presentation: Freie Univerisitat - Berlin Summit 2012

  • 1. Large-Scale Analysis of Web Pages − on a Startup Budget? Hannes Mühleisen, Web-Based Systems Group AWS Summit 2012 | Berlin
  • 3. Our Starting Point • Websites now embed structured data in HTML 2
  • 4. Our Starting Point • Websites now embed structured data in HTML • Various Vocabularies possible • schema.org, Open Graph protocol, ... 2
  • 5. Our Starting Point • Websites now embed structured data in HTML • Various Vocabularies possible • schema.org, Open Graph protocol, ... • Various Encoding Formats possible • μFormats, RDFa, Microdata 2
  • 6. Our Starting Point • Websites now embed structured data in HTML • Various Vocabularies possible • schema.org, Open Graph protocol, ... • Various Encoding Formats possible • μFormats, RDFa, Microdata Question: How are Vocabularies and Formats used? 2
  • 7. Web Indices • To answer our question, we need to access to raw Web data. 3
  • 8. Web Indices • To answer our question, we need to access to raw Web data. • However, maintaining Web indices is insanely expensive • Re-Crawling, Storage, currently ~50 B pages (Google) 3
  • 9. Web Indices • To answer our question, we need to access to raw Web data. • However, maintaining Web indices is insanely expensive • Re-Crawling, Storage, currently ~50 B pages (Google) • Google and Bing have indices, but do not let outsiders in 3
  • 10. Non-Profit Organization 4
  • 11. Non-Profit Organization • Runs crawler and provides HTML dumps 4
  • 12. Non-Profit Organization • Runs crawler and provides HTML dumps • Available data: • Index 02-12: 1.7 B URLs (21 TB) • Index 09/12: 2.8 B URLs (29 TB) 4
  • 13. Non-Profit Organization • Runs crawler and provides HTML dumps • Available data: • Index 02-12: 1.7 B URLs (21 TB) • Index 09/12: 2.8 B URLs (29 TB) • Available on AWS Public Data Sets 4
  • 14. Why AWS? • Now that we have a web crawl, how do we run our analysis? • Unpacking and DOM-Parsing on 50 TB? (CPU-heavy!) 5
  • 15. Why AWS? • Now that we have a web crawl, how do we run our analysis? • Unpacking and DOM-Parsing on 50 TB? (CPU-heavy!) • Preliminary analysis: 1 GB / hour / CPU possible • 8-CPU Desktop: 8 months • 64-CPU Server: 1 month • 100 8-CPU EC2-Instances: ~ 3 days 5
  • 17. Common Crawl Dataset Size 1 CPU, 1 h
  • 18. Common Crawl Dataset Size 1 CPU, 1 h 1000 € PC, 1 h
  • 19. Common Crawl Dataset Size 1 CPU, 1 h 1000 € PC, 1 h 5000 € Server, 1 h
  • 20. Common Crawl Dataset Size 1 CPU, 1 h 1000 € PC, 1 h 5000 € Server, 1 h 17 € EC2 Instances, 1 h
  • 21. AWS Setup • Data Input: Read Index Splits from S3 7
  • 22. AWS Setup • Data Input: Read Index Splits from S3 • Job Coordination: SQS Message Queue 7
  • 23. AWS Setup • Data Input: Read Index Splits from S3 • Job Coordination: SQS Message Queue • Workers: 100 EC2 Spot Instances (c1.xlarge, ~0.17 € / h) 7
  • 24. AWS Setup • Data Input: Read Index Splits from S3 • Job Coordination: SQS Message Queue • Workers: 100 EC2 Spot Instances (c1.xlarge, ~0.17 € / h) • Result Output: Write to S3 7
  • 25. AWS Setup • Data Input: Read Index Splits from S3 • Job Coordination: SQS Message Queue • Workers: 100 EC2 Spot Instances (c1.xlarge, ~0.17 € / h) • Result Output: Write to S3 • Logging: SDB 7
  • 26. SQS • Each input file queued in SQS • EC2 Workers take tasks from SQS • Workers read and write S3 buckets 42 ... EC2 42 43 ... R42 R43 ... CC WDC S3
  • 27. SQS • Each input file queued in SQS • EC2 Workers take tasks from SQS • Workers read and write S3 buckets 42 ... EC2 42 43 ... R42 R43 ... CC WDC S3
  • 28. SQS • Each input file queued in SQS • EC2 Workers take tasks from SQS • Workers read and write S3 buckets 42 ... EC2 42 43 ... R42 R43 ... CC WDC S3
  • 29. Results - Types of Data Microdata 02/2012 RDFa 02/2012 Website Structure 23 % 5e+06 RDFa 2009/2010 Microdata 2009/2010 Products, Reviews 19 % Entity Count (log) 5e+05 Movies, Music, ... 15 % 5e+04 Geodata 8 % 5e+03 People, Organizations 7 % 0 50 100 150 200 2012 Microdata Breakdown Type 9
  • 30. Results - Types of Data Microdata 02/2012 RDFa 02/2012 Website Structure 23 % 5e+06 RDFa 2009/2010 Microdata 2009/2010 Products, Reviews 19 % Entity Count (log) 5e+05 Movies, Music, ... 15 % 5e+04 Geodata 8 % 5e+03 People, Organizations 7 % 0 50 100 150 200 2012 Microdata Breakdown Type • Available data largely determined by major player support 9
  • 31. Results - Types of Data Microdata 02/2012 RDFa 02/2012 Website Structure 23 % 5e+06 RDFa 2009/2010 Microdata 2009/2010 Products, Reviews 19 % Entity Count (log) 5e+05 Movies, Music, ... 15 % 5e+04 Geodata 8 % 5e+03 People, Organizations 7 % 0 50 100 150 200 2012 Microdata Breakdown Type • Available data largely determined by major player support • “If Google consumes it, we will publish it” 9
  • 32. Results - Formats 2009/2010 • 4 02−2012 URLs with embedded Data: +6% Percentage of URLs 3 2 1 0 RDFa Microdata geo hcalendar hcard hreview XFN Format 10
  • 33. Results - Formats 2009/2010 • 4 02−2012 URLs with embedded Data: +6% Percentage of URLs 3 • Microdata +14% (schema.org?) 2 1 0 RDFa Microdata geo hcalendar hcard hreview XFN Format 10
  • 34. Results - Formats 2009/2010 • 4 02−2012 URLs with embedded Data: +6% Percentage of URLs 3 • Microdata +14% (schema.org?) 2 • 1 RDFa +26% (Facebook?) 0 RDFa Microdata geo hcalendar hcard hreview XFN Format 10
  • 35. Results - Extracted Data • Extracted data available for download at • www.webdatacommons.org 11
  • 36. Results - Extracted Data • Extracted data available for download at • www.webdatacommons.org • Formats: RDF (~90 GB) and CSV Tables for Microformats (!) 11
  • 37. Results - Extracted Data • Extracted data available for download at • www.webdatacommons.org • Formats: RDF (~90 GB) and CSV Tables for Microformats (!) • Have a look! 11
  • 38. AWS Costs • Ca. 5500 Machine-Hours were required • 1100 € billed by AWS for that 12
  • 39. AWS Costs • Ca. 5500 Machine-Hours were required • 1100 € billed by AWS for that • Cost for other services negligible * 12
  • 40. AWS Costs • Ca. 5500 Machine-Hours were required • 1100 € billed by AWS for that • Cost for other services negligible * • * At first, we underestimated SDB cost 12
  • 41. Takeaways • Web Data Commons now publishes the largest set of structured data from Web pages available 13
  • 42. Takeaways • Web Data Commons now publishes the largest set of structured data from Web pages available • Large-Scale Web Analysis now possible with Common Crawl datasets 13
  • 43. Takeaways • Web Data Commons now publishes the largest set of structured data from Web pages available • Large-Scale Web Analysis now possible with Common Crawl datasets • AWS great for massive ad-hoc computing power and complexity reduction 13
  • 44. Takeaways • Web Data Commons now publishes the largest set of structured data from Web pages available • Large-Scale Web Analysis now possible with Common Crawl datasets • AWS great for massive ad-hoc computing power and complexity reduction • Choose your architecture wisely, test by experiment, for us EMR was too expensive. 13
  • 45. Thank You! Questions? Want to hire me? Web Resources: http://webdatacommons.org http://hannes.muehleisen.org