SlideShare una empresa de Scribd logo
1 de 13
The WebDataCommons 
Microdata, RDFa, and Microformat 
Dataset Series 
Robert Meusel, Petar Petrovski, and 
Christian Bizer
2 
HTML-embedded Structured Data on the Web 
More and more websites semantically markup the content of 
their HTML pages. 
RDFa 
Microdata 
Microformats 
The WebDataCommons Microdata, RDFa, and Microformats Dataset Series
1. _:node1 <http://www.w3.org/1999/02/22-rdf-syntax-ns# 
3. _:node1 <http://www.w3.org/1999/02/22-rdf-syntax-ns# 
4. _:node1 <http://schema.org/Offer/price> "u20AC 
5. _:node1 <http://schema.org/Offer/priceCurrency> 
3 
Dataset Creation 
 Common Crawl Foundation Corpora of 2010, 2012 and 2013 
• Snapshot of popular pages of the Web 
• Continuously new crawls available 
 Parsing the HTML pages using Apache Any23 
• Using a distributed framework on 100 parallel EC2 instances 
type> <http://schema.org/Product> . 
2. _:node1 <http://schema.org/Product/name> 
"Predator Instinct FG Fuu00DFballschuh"@de . 
type> <http://schema.org/Offer> . 
219,95"@de . 
"EUR"@de . 
6. … 
Any23 
The framework is easy to adapt and is publicly available at: 
http://webdatacommons.org/framework/ 
The WebDataCommons Microdata, RDFa, and Microformats Dataset Series
4 
Dataset Series Overview 
 Series contains three datasets from 2010, 2012 and 2013 
 All together over 30 billion RDF quads 
 Each dataset is again split into subsets including quads 
extracted for a particular markup language 
The WebDataCommons Microdata, RDFa, and Microformats Dataset Series
5 
Overview of 2013 dataset 
 Over 1.7 million domains using at least one markup language 
 Over 17 billion quads with over 4 billion records (typed entities) 
 hCard still most dominant among domains 
 Microdata contains the largest number of quads 
The WebDataCommons Microdata, RDFa, and Microformats Dataset Series
6 
Divergence in Class and Property Usage in 2013 
 Small number of classes and properties is 
used by a large number of domains 
 RDFa: 646k classes and 27k properties, 
but <1k classes and ~2k properties are 
used by at least two different domains 
 MD: 15k classes and 170k properties, but 
~1.2k classes and <13k properties are 
used by at least two different domains. 
Classes and Properties used by solely one 
domain are mostly typos 
The WebDataCommons Microdata, RDFa, and Microformats Dataset Series
7 
RDFa Insights 2013 
 Usage of various vocabularies to describe information: 
• Strong presents of Open Graph Protocol (e.g. Facebook) 
• FOAF and SIOC (Blog-Software as Drupal) 
 Largest topics covered are: 
• Articles and Documents (Blogs and News portals) 
• Products, Reviews and Ratings 
• Organizations 
The WebDataCommons Microdata, RDFa, and Microformats Dataset Series
8 
Microdata Insights 2013 and 2012 
 Clear increase of development in comparison to 2012 
 Still two vocabularies deployed: data-vocabulary and schema.org 
 Largest topical areas: 
• Postal Addresses and Locations 
• Products, Offers and Ratings 
• Organizations and Persons 
• Articles and Blogs 
• Breadcrumb 
The WebDataCommons Microdata, RDFa, and Microformats Dataset Series
9 
Focus on Schema.org/Product 
 One of the largest public available 
product collections 
 Almost 100 million records 
described with name, offer and 
image 
 34 million records contain a 
further description 
 11% of all product records include 
a brand 
The WebDataCommons Microdata, RDFa, and Microformats Dataset Series
10 
Microformats Insights 2013 
 Most dominant vocabulary is hCard 
 Still a very solid deployment 
 Topics are: 
• Persons & Organizations 
• Events 
• Products and reviews 
• Recipes 
The WebDataCommons Microdata, RDFa, and Microformats Dataset Series
11 
Opportunities & Challenges 
Opportunities 
 Vast amounts of free data, 
created from people all over 
the world 
 Large topical coverage from 
broad areas (as products) to 
niche (as recipes) 
 High up-to-dateness of 
information, as popular 
pages potentially update 
their content frequently 
Challenges 
 Data quality assessment, as 
the data is created by 
experts and rookies 
 Further information 
extraction, as a flat schema 
and rather low number of 
properties are used 
 Identity resolution, as the 
data does hardly contain 
identifiers 
The WebDataCommons Microdata, RDFa, and Microformats Dataset Series
12 
Possible Application Domains 
 Enriching existing knowledge bases 
• E.g. mapping DBPedia Classes and Properties to the corresponding classes and 
properties within the available vocabularies to add missing information and 
extend entity knowledge 
• As shown by Lehmberg et al. winner of the Semantic Web Challenge (Big Data 
Track) 2014, this data can be used as additional source (besides others) to gather 
and return wider search results 
 Design and adaption of algorithms and methods to face the 
characteristics of such web data 
• Training of data extraction methods to gather not marked data within the HTML 
pages 
• Further extraction of additional information from the raw data, e.g. extraction of 
skills, requirements etc. from job posting descriptions 
 Starting point for further data discovery 
• The dataset can be used as starting points for further data crawling, as not all 
pages from a domain are included (in most of the cases) 
The WebDataCommons Microdata, RDFa, and Microformats Dataset Series
13 
Thank you! Questions? Feedback? 
Data and more statistics can be found at: 
http://webdatacommons.org/structureddata/index.html 
More interesting datasets and analysis can be found at the 
website of WebDataCommons: 
http://webdatacommons.org/index.html 
Acknowledgement 
The extraction and analysis of the datasets was supported by AWS in Education Grant 
and the EU FP7 project LOD2. Special thanks to SWSA for supporting the travel to ISWC 
2014. 
The WebDataCommons Microdata, RDFa, and Microformats Dataset Series

Más contenido relacionado

La actualidad más candente

Top 5 Considerations When Evaluating NoSQL
Top 5 Considerations When Evaluating NoSQLTop 5 Considerations When Evaluating NoSQL
Top 5 Considerations When Evaluating NoSQL
MongoDB
 
Wed batsakis tut_chalasdlenges of preservations
Wed batsakis tut_chalasdlenges of preservationsWed batsakis tut_chalasdlenges of preservations
Wed batsakis tut_chalasdlenges of preservations
eswcsummerschool
 

La actualidad más candente (20)

data.bris - Use case, role and functionality for CKAN adoption
data.bris - Use case, role and functionality for CKAN adoptiondata.bris - Use case, role and functionality for CKAN adoption
data.bris - Use case, role and functionality for CKAN adoption
 
Top 5 Considerations When Evaluating NoSQL
Top 5 Considerations When Evaluating NoSQLTop 5 Considerations When Evaluating NoSQL
Top 5 Considerations When Evaluating NoSQL
 
RDAP 16 Poster: Hacking the figshare API to Create Enhanced Metadata Records
RDAP 16 Poster: Hacking the figshare API to Create Enhanced Metadata RecordsRDAP 16 Poster: Hacking the figshare API to Create Enhanced Metadata Records
RDAP 16 Poster: Hacking the figshare API to Create Enhanced Metadata Records
 
The Bounties of Semantic Data Integration for the Enterprise
The Bounties of Semantic Data Integration for the Enterprise The Bounties of Semantic Data Integration for the Enterprise
The Bounties of Semantic Data Integration for the Enterprise
 
Wed batsakis tut_chalasdlenges of preservations
Wed batsakis tut_chalasdlenges of preservationsWed batsakis tut_chalasdlenges of preservations
Wed batsakis tut_chalasdlenges of preservations
 
Umesha naik metadata
Umesha naik metadataUmesha naik metadata
Umesha naik metadata
 
Metadata : Concentrating on the data, not on the scheme
Metadata : Concentrating on the data, not on the schemeMetadata : Concentrating on the data, not on the scheme
Metadata : Concentrating on the data, not on the scheme
 
Resilient Linked Data
Resilient Linked DataResilient Linked Data
Resilient Linked Data
 
Establishing the Connection: Creating a Linked Data Version of the BNB
Establishing the Connection: Creating a Linked Data Version of the BNBEstablishing the Connection: Creating a Linked Data Version of the BNB
Establishing the Connection: Creating a Linked Data Version of the BNB
 
Using Linked Data Resources to generate web pages based on a BBC case study
Using Linked Data Resources to generate web pages based on a BBC case studyUsing Linked Data Resources to generate web pages based on a BBC case study
Using Linked Data Resources to generate web pages based on a BBC case study
 
Data quality problem and solution
Data quality problem and solutionData quality problem and solution
Data quality problem and solution
 
PID services - understandability and findability of data
PID services - understandability and findability of dataPID services - understandability and findability of data
PID services - understandability and findability of data
 
PID Services for FAIR data
PID Services for FAIR dataPID Services for FAIR data
PID Services for FAIR data
 
Gap Analysis
Gap AnalysisGap Analysis
Gap Analysis
 
The Information Workbench - Linked Data and Semantic Wikis in the Enterprise
The Information Workbench - Linked Data and Semantic Wikis in the EnterpriseThe Information Workbench - Linked Data and Semantic Wikis in the Enterprise
The Information Workbench - Linked Data and Semantic Wikis in the Enterprise
 
Weaving SIOC into the Web of Linked Data
Weaving SIOC into the Web of Linked DataWeaving SIOC into the Web of Linked Data
Weaving SIOC into the Web of Linked Data
 
Metadata Standards
Metadata StandardsMetadata Standards
Metadata Standards
 
Crossref LIVE US Online
Crossref LIVE US OnlineCrossref LIVE US Online
Crossref LIVE US Online
 
Basic concept of Linked & Linked open Government data
Basic concept of Linked & Linked open Government data Basic concept of Linked & Linked open Government data
Basic concept of Linked & Linked open Government data
 
Linked Data for the Masses: The approach and the Software
Linked Data for the Masses: The approach and the SoftwareLinked Data for the Masses: The approach and the Software
Linked Data for the Masses: The approach and the Software
 

Similar a The Web Data Commons Microdata, RDFa, and Microformat Dataset Series @ ISWC2014

Wed roman tut_open_datapub
Wed roman tut_open_datapubWed roman tut_open_datapub
Wed roman tut_open_datapub
eswcsummerschool
 

Similar a The Web Data Commons Microdata, RDFa, and Microformat Dataset Series @ ISWC2014 (20)

Wed roman tut_open_datapub
Wed roman tut_open_datapubWed roman tut_open_datapub
Wed roman tut_open_datapub
 
The Web of data and web data commons
The Web of data and web data commonsThe Web of data and web data commons
The Web of data and web data commons
 
Industry Ontologies: Case Studies in Creating and Extending Schema.org for In...
Industry Ontologies: Case Studies in Creating and Extending Schema.org for In...Industry Ontologies: Case Studies in Creating and Extending Schema.org for In...
Industry Ontologies: Case Studies in Creating and Extending Schema.org for In...
 
Industry Ontologies: Case Studies in Creating and Extending Schema.org
Industry Ontologies: Case Studies in Creating and Extending Schema.org Industry Ontologies: Case Studies in Creating and Extending Schema.org
Industry Ontologies: Case Studies in Creating and Extending Schema.org
 
IWMW 1998: Deploying new web technologies
IWMW 1998: Deploying new web technologiesIWMW 1998: Deploying new web technologies
IWMW 1998: Deploying new web technologies
 
LinkedIn Infrastructure (analytics@webscale, at fb 2013)
LinkedIn Infrastructure (analytics@webscale, at fb 2013)LinkedIn Infrastructure (analytics@webscale, at fb 2013)
LinkedIn Infrastructure (analytics@webscale, at fb 2013)
 
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data LakesWebinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
 
L'architettura di classe enterprise di nuova generazione - Massimo Brignoli
L'architettura di classe enterprise di nuova generazione - Massimo BrignoliL'architettura di classe enterprise di nuova generazione - Massimo Brignoli
L'architettura di classe enterprise di nuova generazione - Massimo Brignoli
 
Quick Introduction to the Semantic Web, RDFa & Microformats
Quick Introduction to the Semantic Web, RDFa & MicroformatsQuick Introduction to the Semantic Web, RDFa & Microformats
Quick Introduction to the Semantic Web, RDFa & Microformats
 
Rank | Analyse | Lead | Search
Rank | Analyse | Lead | SearchRank | Analyse | Lead | Search
Rank | Analyse | Lead | Search
 
KEDL DBpedia 2019
KEDL DBpedia  2019KEDL DBpedia  2019
KEDL DBpedia 2019
 
Navigating the World of User Data Management and Data Discovery
Navigating the World of User Data Management and Data DiscoveryNavigating the World of User Data Management and Data Discovery
Navigating the World of User Data Management and Data Discovery
 
Modèles de données et langages de description ouverts 6 - 2021-2022
Modèles de données et langages de description ouverts   6 - 2021-2022Modèles de données et langages de description ouverts   6 - 2021-2022
Modèles de données et langages de description ouverts 6 - 2021-2022
 
How google is using linked data today and vision for tomorrow
How google is using linked data today and vision for tomorrowHow google is using linked data today and vision for tomorrow
How google is using linked data today and vision for tomorrow
 
Linked Energy Data Generation
Linked Energy Data GenerationLinked Energy Data Generation
Linked Energy Data Generation
 
RDFa: introduction, comparison with microdata and microformats and how to use it
RDFa: introduction, comparison with microdata and microformats and how to use itRDFa: introduction, comparison with microdata and microformats and how to use it
RDFa: introduction, comparison with microdata and microformats and how to use it
 
GoodRelations & RDFa for Deep Comparison Shopping on a Web Scale
GoodRelations & RDFa for Deep Comparison Shopping on a Web ScaleGoodRelations & RDFa for Deep Comparison Shopping on a Web Scale
GoodRelations & RDFa for Deep Comparison Shopping on a Web Scale
 
Introduction to W3C Linked Data Platform
Introduction to W3C Linked Data PlatformIntroduction to W3C Linked Data Platform
Introduction to W3C Linked Data Platform
 
Scaling up Linked Data
Scaling up Linked DataScaling up Linked Data
Scaling up Linked Data
 
Apache Any23 - Anything to Triples
Apache Any23 - Anything to TriplesApache Any23 - Anything to Triples
Apache Any23 - Anything to Triples
 

Último

The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
heathfieldcps1
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 

Último (20)

HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxHMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptx
 
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptxOn_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptx
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Wellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptxWellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptx
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Plant propagation: Sexual and Asexual propapagation.pptx
Plant propagation: Sexual and Asexual propapagation.pptxPlant propagation: Sexual and Asexual propapagation.pptx
Plant propagation: Sexual and Asexual propapagation.pptx
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
 

The Web Data Commons Microdata, RDFa, and Microformat Dataset Series @ ISWC2014

  • 1. The WebDataCommons Microdata, RDFa, and Microformat Dataset Series Robert Meusel, Petar Petrovski, and Christian Bizer
  • 2. 2 HTML-embedded Structured Data on the Web More and more websites semantically markup the content of their HTML pages. RDFa Microdata Microformats The WebDataCommons Microdata, RDFa, and Microformats Dataset Series
  • 3. 1. _:node1 <http://www.w3.org/1999/02/22-rdf-syntax-ns# 3. _:node1 <http://www.w3.org/1999/02/22-rdf-syntax-ns# 4. _:node1 <http://schema.org/Offer/price> "u20AC 5. _:node1 <http://schema.org/Offer/priceCurrency> 3 Dataset Creation  Common Crawl Foundation Corpora of 2010, 2012 and 2013 • Snapshot of popular pages of the Web • Continuously new crawls available  Parsing the HTML pages using Apache Any23 • Using a distributed framework on 100 parallel EC2 instances type> <http://schema.org/Product> . 2. _:node1 <http://schema.org/Product/name> "Predator Instinct FG Fuu00DFballschuh"@de . type> <http://schema.org/Offer> . 219,95"@de . "EUR"@de . 6. … Any23 The framework is easy to adapt and is publicly available at: http://webdatacommons.org/framework/ The WebDataCommons Microdata, RDFa, and Microformats Dataset Series
  • 4. 4 Dataset Series Overview  Series contains three datasets from 2010, 2012 and 2013  All together over 30 billion RDF quads  Each dataset is again split into subsets including quads extracted for a particular markup language The WebDataCommons Microdata, RDFa, and Microformats Dataset Series
  • 5. 5 Overview of 2013 dataset  Over 1.7 million domains using at least one markup language  Over 17 billion quads with over 4 billion records (typed entities)  hCard still most dominant among domains  Microdata contains the largest number of quads The WebDataCommons Microdata, RDFa, and Microformats Dataset Series
  • 6. 6 Divergence in Class and Property Usage in 2013  Small number of classes and properties is used by a large number of domains  RDFa: 646k classes and 27k properties, but <1k classes and ~2k properties are used by at least two different domains  MD: 15k classes and 170k properties, but ~1.2k classes and <13k properties are used by at least two different domains. Classes and Properties used by solely one domain are mostly typos The WebDataCommons Microdata, RDFa, and Microformats Dataset Series
  • 7. 7 RDFa Insights 2013  Usage of various vocabularies to describe information: • Strong presents of Open Graph Protocol (e.g. Facebook) • FOAF and SIOC (Blog-Software as Drupal)  Largest topics covered are: • Articles and Documents (Blogs and News portals) • Products, Reviews and Ratings • Organizations The WebDataCommons Microdata, RDFa, and Microformats Dataset Series
  • 8. 8 Microdata Insights 2013 and 2012  Clear increase of development in comparison to 2012  Still two vocabularies deployed: data-vocabulary and schema.org  Largest topical areas: • Postal Addresses and Locations • Products, Offers and Ratings • Organizations and Persons • Articles and Blogs • Breadcrumb The WebDataCommons Microdata, RDFa, and Microformats Dataset Series
  • 9. 9 Focus on Schema.org/Product  One of the largest public available product collections  Almost 100 million records described with name, offer and image  34 million records contain a further description  11% of all product records include a brand The WebDataCommons Microdata, RDFa, and Microformats Dataset Series
  • 10. 10 Microformats Insights 2013  Most dominant vocabulary is hCard  Still a very solid deployment  Topics are: • Persons & Organizations • Events • Products and reviews • Recipes The WebDataCommons Microdata, RDFa, and Microformats Dataset Series
  • 11. 11 Opportunities & Challenges Opportunities  Vast amounts of free data, created from people all over the world  Large topical coverage from broad areas (as products) to niche (as recipes)  High up-to-dateness of information, as popular pages potentially update their content frequently Challenges  Data quality assessment, as the data is created by experts and rookies  Further information extraction, as a flat schema and rather low number of properties are used  Identity resolution, as the data does hardly contain identifiers The WebDataCommons Microdata, RDFa, and Microformats Dataset Series
  • 12. 12 Possible Application Domains  Enriching existing knowledge bases • E.g. mapping DBPedia Classes and Properties to the corresponding classes and properties within the available vocabularies to add missing information and extend entity knowledge • As shown by Lehmberg et al. winner of the Semantic Web Challenge (Big Data Track) 2014, this data can be used as additional source (besides others) to gather and return wider search results  Design and adaption of algorithms and methods to face the characteristics of such web data • Training of data extraction methods to gather not marked data within the HTML pages • Further extraction of additional information from the raw data, e.g. extraction of skills, requirements etc. from job posting descriptions  Starting point for further data discovery • The dataset can be used as starting points for further data crawling, as not all pages from a domain are included (in most of the cases) The WebDataCommons Microdata, RDFa, and Microformats Dataset Series
  • 13. 13 Thank you! Questions? Feedback? Data and more statistics can be found at: http://webdatacommons.org/structureddata/index.html More interesting datasets and analysis can be found at the website of WebDataCommons: http://webdatacommons.org/index.html Acknowledgement The extraction and analysis of the datasets was supported by AWS in Education Grant and the EU FP7 project LOD2. Special thanks to SWSA for supporting the travel to ISWC 2014. The WebDataCommons Microdata, RDFa, and Microformats Dataset Series