SlideShare una empresa de Scribd logo
1 de 25
Heuristics for Fixing Common Errors
in Deployed schema.org Microdata
Robert Meusel and Heiko Paulheim
2
Motivation
Heuristics for Fixing Common Errors in Deployed schema.org Microdata - Meusel and Paulheim - ESWC 2015
3
Microdata in a Nutshell
 Adding structured information to web pages
• By marking up contents and entities
 Arbitrary vocabularies are possible
• Practically, only schema.org is deployed on a large scale
• Plus its historical predecessor: data-vocabulary.org
 Similar to RDFa
Heuristics for Fixing Common Errors in Deployed schema.org Microdata - Meusel and Paulheim - ESWC 2015
<div itemscope itemtype="http://schema.org/PostalAddress">
<span itemprop="name">Data and Web Science Group</span>
<span itemprop="addressLocality">Mannheim</span>,
<span itemprop="postalCode">68131</span>
<span itemprop="addressCountry">Germany</span>
</div>
4
Schema.org in a Nutshell
 Vocabulary for marking up entities on web pages
• 675 classes and 965 properties (as of May 2015, release 2.0)
 Promoted and consumes by major search engine companies
• Google, Bing, Yahoo!, and Yandex
• Google Rich Snippets
 Community-driven evolution and development
 Can be used with Microdata and RDFa
• Hardly used together with RDFa (<0.1% of RDFa-using websites [1])
Heuristics for Fixing Common Errors in Deployed schema.org Microdata - Meusel and Paulheim - ESWC 2015
[1] http://webdatacommons.org/structureddata/2014-12/stats/stats.html
5
Schema.org in a Nutshell – Coverage
 Schema.org has incorporated some popular vocabularies, like:
• Good Relations (2012)
• W3C BibExtend (2014)
• MusicBrainz vocabulary (2015)
• Automotive Ontology (2015)
Heuristics for Fixing Common Errors in Deployed schema.org Microdata - Meusel and Paulheim - ESWC 2015
6
Microdata with Schema.org in HTML Pages
<html>
…
<body>
…
<div id="main-section" class="performance left" data-
sku="M17242_580“>
<h1> Predator Instinct FG Fußballschuh
</h1>
<div>
<meta content="EUR">
<span
data-sale-price="219.95">219,95</span>
…
</body>
</html>
HTML pages embed directly
markup languages to annotate
items using different vocabularies
<html>
…
<body>
…
<div id="main-section" class="performance left" data-
sku="M17242_580" itemscope
itemtype="http://schema.org/Product">
<h1 itemprop="name"> Predator Instinct FG Fußballschuh
</h1>
<div itemscope itemtype="http://schema.org/Offer"
itemprop="offers">
<meta itemprop="priceCurrency" content="EUR">
<span itemprop="price" data-sale-
price="219.95">219,95</span>
…
</body>
</html>
1._:node1 <http://www.w3.org/1999/02/22-rdf-syntax-
ns#type> <http://schema.org/Product> .
2._:node1 <http://schema.org/Product/name> "Predator
Instinct FG Fußballschuh"@de .
3._:node1 <http://www.w3.org/1999/02/22-rdf-syntax-
ns#type> <http://schema.org/Offer> .
4._:node1 <http://schema.org/Offer/price>
"219,95"@de .
5._:node1 <http://schema.org/Offer/priceCurrency>
"EUR" .
6.…
Heuristics for Fixing Common Errors in Deployed schema.org Microdata - Meusel and Paulheim - ESWC 2015
7
So Far, So Good …
 Schema is well explained on the schema.org websites
 Data providers are supported by validation tools
(e.g. Yandex structured data validator) when deploying
 Win-Win for both sides
 Plus: Data is (mostly) free accessible in the Web
…. but:
 >100.000s of data providers, which are mostly no schema.org
experts or evangelists
 Validators & schema might help but there is no need to use
them
Heuristics for Fixing Common Errors in Deployed schema.org Microdata - Meusel and Paulheim - ESWC 2015
8
So What Could Possibly Go Wrong?
 Usage of wrong namespaces
• http./schema.org
 Usage of undefined types
• http://schema.org/Breadcrumb
 Usage of undefined properties
• http://schema.org/postID
 Confusion of datatype properties and object properties
• _:n1 s:address “Jump Street 21”
 Property domain and range violations
• _:n1 a s:Product
_:n1 s:price “for free”
Heuristics for Fixing Common Errors in Deployed schema.org Microdata - Meusel and Paulheim - ESWC 2015
9
Compiling a Schema.org Dataset
 Starting point: all pages in the CommonCrawl that contain
Microdata
 What could be (meant to be) schema.org?
• Everything that contains “schema.org” as substring in a namespace
• Everything that contains URIs where the protocol and authority is similar
to “http://schema.org/” (with an EditDistance of 1)
• Filter noise: removing all namespaces that occur only on one website
Heuristics for Fixing Common Errors in Deployed schema.org Microdata - Meusel and Paulheim - ESWC 2015
Final corpus consists of:
6.4 billion triples
extracted from over 217 billion pages
belonging to 398,542 data providers
which is 86% of all Microdata in the corpus.
10
Namespace Violations
 More than 98% of the preselected pages use a correct
namespace
 Frequent namespace variations:
• http://www.schema.org/
• https://schema.org
• http:/schema.org
• http://SChema.org
Heuristics for Fixing Common Errors in Deployed schema.org Microdata - Meusel and Paulheim - ESWC 2015
Debated!
11
Undefined Types
 Used by around 6% of all data providers
 Typical causes:
• Misspellings: http://schema.org/Stores
• Miscapitalization: http://schema.org/localbusiness
 Comparison to LOD Compliance
• 5.8% of all Microdata documents
• 38.8% of all LOD documents (Hogan et al., 2010)
Heuristics for Fixing Common Errors in Deployed schema.org Microdata - Meusel and Paulheim - ESWC 2015
…/Store
…/LocalBusiness
12
Undefined Properties
 Used by around 4% of all data providers
 Typical Causes:
• Miscapitalization: http://schema.org/contentURL
• Close but miss: http://schema.org/currency
http://schema.org/fax
• Made up: http://schema.org/blogId
http://schema.org/postId
 Comparison to LOD Compliance
• 9.7% of all Microdata documents
• 72.4% of all LOD documents (Hogan et al., 2010)
Heuristics for Fixing Common Errors in Deployed schema.org Microdata - Meusel and Paulheim - ESWC 2015
…/contentUrl
…/priceCurrency
13
Confusion of Object Properties with Data Properties
 i.e. using an object property with a string values
 Used by over 56.6% of all data providers
 Typical properties:
• http://schema.org/addresscountry
• http://schema.org/manufacturer
• http://schema.org/author
• http://schema.org/brand
 Comparison to LOD Compliance
• 24.35% of all Microdata documents
• 8% of all LOD documents (Hogan et al., 2010)
Heuristics for Fixing Common Errors in Deployed schema.org Microdata - Meusel and Paulheim - ESWC 2015
14
Confusion of Data Properties with Object Properties
 i.e. using a data property with a complex object
 Used by less than 0.2% of all data providers
 Comparison to LOD Compliance
• 0.6% of all Microdata documents
• 2.2% of all LOD documents (Hogan et al., 2010)
Heuristics for Fixing Common Errors in Deployed schema.org Microdata - Meusel and Paulheim - ESWC 2015
15
Property Domain Violations
 i.e. using a property with a subject not included in its domain
 Used by 4% of all data providers
 Typical violations are mainly shortcuts
• s:price used on s:Product
• s:streetAddress used on s:LocalBusiness
 Comparison to LOD Compliance:
• Difficult to compare as semantics are different
• List of schema.org domains is exhaustive
• LOD: open world assumption
Heuristics for Fixing Common Errors in Deployed schema.org Microdata - Meusel and Paulheim - ESWC 2015
s:Product  s:Offer  s:price
s:LocalBusiness  s:PostalAddress
 s:streetAddress
16
Data Property Range Violations
 i.e. using a data property with an incompatible literal
 Used by 9.6% of all data providers
 20 most common violations:
• 13 dates
• 3 Urls
• 2 numbers
• 2 times
 Comparison to LOD Compliance:
• 12.06% of all Microdata documents
• 4.6% of all LOD documents (Hogan et al., 2010)
Heuristics for Fixing Common Errors in Deployed schema.org Microdata - Meusel and Paulheim - ESWC 2015
“a month ago”
“2 pieces”
“last week”
17
Object Property Range Violations
 i.e. using an object property with a type outside its range
 Used by 8.6% of all data providers
 Typical violations:
• s:mainContentOfPage with s:Blog instead of
s:WebPageElement
 Comparison to LOD Compliance
• 3.2% of all Microdata documents
• 2.4% of all LOD documents (Hogan et al., 2010)
Heuristics for Fixing Common Errors in Deployed schema.org Microdata - Meusel and Paulheim - ESWC 2015
Maybe a hint at a missing
hierarchy relation?
18
Schema.org Compliance Summary
 Surprisingly high level of compliance
 Providers are often not technology evangelists (unlike in LOD)
• Anybody can start publishing Microdata annotated HTML
 Most often higher than for LOD
• Except for the confusion of data and object properties
Heuristics for Fixing Common Errors in Deployed schema.org Microdata - Meusel and Paulheim - ESWC 2015
But still the number of erroneous pages could prevent data
consumers to make use of the annotated data and understand
the semantics.
19
Identifying and Fixing Wrong Namespaces
 Main errors due to missing slashes, wrong protocol and
capitalization
 Simple rules to handle wrong namespaces
• Removal of www
• Replacement of https by http
• Conversion to lower case
• Adding of missing slashes and removal of prefixes before schema.org
 Impact:
• 147 of 148 wrongly spelled namespaces could be fixed
Heuristics for Fixing Common Errors in Deployed schema.org Microdata - Meusel and Paulheim - ESWC 2015
20
Handling Undefined Types and Properties
 Main errors due to wrong capitalization
 Heuristic: Ignore capitalization when parsing entities from web
pages, and replace the schema element with the properly
capitalized version
 Impact (together with namespace fixes):
• Correct type replacement within 71% of all data providers
• Correct property replacement within 65% of all data providers
• Remaining data providers account for over 70% of all
undefined types and properties and
are hard-to-detect typos
Heuristics for Fixing Common Errors in Deployed schema.org Microdata - Meusel and Paulheim - ESWC 2015
21
Handling Object Properties with Literal Values
 Main objects modeled as literals are s:Organization,
s:Person and s:PostalAddress
 Manually inspecting those values for the object
properties s:author, s:creator and s:address
 Impact
• The heuristic could replace all misused
object properties on 92,449 data providers
• Might lead to changes in the type distribution
• E.g. 14 million new entities of type
s:PostalAddress
Heuristics for Fixing Common Errors in Deployed schema.org Microdata - Meusel and Paulheim - ESWC 2015
_:1 s:author “Robert” .
_:1 s:author _:2 .
_:2 a s:Person .
_:2 s:name “Robert” .
22
Handling Property Domain Violations
 Main cause are shortcuts
 Heuristic to find the
property R and type T
for a domain violation
of property s:r:
One unique solution for only one of
the two patterns:
 Impact:
• 31% of erroneous data providers could be fixed
• No solution or multiple solutions for the rest
Heuristics for Fixing Common Errors in Deployed schema.org Microdata - Meusel and Paulheim - ESWC 2015
_:1 “5”
s:aggregatedRating
s:aggregatedRating
is not defined for
type of _:1
_:2
s:aggregatedRating
Type?
Property?
R s:domainIncludes s:t .
R s:rangeIncludes T .
s:r s:domainIncludes T .
R s:rangeIncludes s:t .
R s:domainIncludes T .
s:r s:domainIncludes T .
23
Heuristics Summary
 Over 410 million wrong triples could be corrected
 Over 700 million missing triples could be added
 Corrections affected in total over 115.000 data providers
• ~ 28% of all data providers in the data set
Heuristics for Fixing Common Errors in Deployed schema.org Microdata - Meusel and Paulheim - ESWC 2015
24
LD4IE Challenge @ ISWC 2015
Heuristics for Fixing Common Errors in Deployed schema.org Microdata - Meusel and Paulheim - ESWC 2015
Learn to annotate entities on HTML pages using already
annotated pages as training set.
 Deadline: 2015-07-15
 Challenge Page: goo.gl/laF6yl
 Contact: Heiko Paulheim
(heiko@dwslab.de)
Good Luck!
25
Thank you! Questions? Feedback?
Data and more insights can be found at:
http://webdatacommons.org/structureddata/2013-
11/stats/fixing_common_errors.html
More interesting datasets and analysis can be found at the
website of WebDataCommons:
http://webdatacommons.org/index.html
Heuristics for Fixing Common Errors in Deployed schema.org Microdata - Meusel and Paulheim - ESWC 2015
Acknowledgement
The extraction and analysis of the datasets was supported
by AWS in Education Grant.

Más contenido relacionado

La actualidad más candente

Wed roman tut_open_datapub
Wed roman tut_open_datapubWed roman tut_open_datapub
Wed roman tut_open_datapubeswcsummerschool
 
Graph Structure in the Web - Revisited. WWW2014 Web Science Track
Graph Structure in the Web - Revisited. WWW2014 Web Science TrackGraph Structure in the Web - Revisited. WWW2014 Web Science Track
Graph Structure in the Web - Revisited. WWW2014 Web Science TrackChris Bizer
 
First Steps in Semantic Data Modelling and Search & Analytics in the Cloud
First Steps in Semantic Data Modelling and Search & Analytics in the CloudFirst Steps in Semantic Data Modelling and Search & Analytics in the Cloud
First Steps in Semantic Data Modelling and Search & Analytics in the CloudOntotext
 
Focused Crawling for Structured Data
Focused Crawling for Structured DataFocused Crawling for Structured Data
Focused Crawling for Structured DataRobert Meusel
 
Getting Started With The Talis Platform
Getting Started With The Talis PlatformGetting Started With The Talis Platform
Getting Started With The Talis PlatformLeigh Dodds
 
DBpedia - An Interlinking Hub in the Web of Data
DBpedia - An Interlinking Hub in the Web of DataDBpedia - An Interlinking Hub in the Web of Data
DBpedia - An Interlinking Hub in the Web of DataChris Bizer
 
Coming to REST
Coming to RESTComing to REST
Coming to RESTMax Goff
 
Open hpi semweb-06-part5
Open hpi semweb-06-part5Open hpi semweb-06-part5
Open hpi semweb-06-part5Nadine Ludwig
 

La actualidad más candente (8)

Wed roman tut_open_datapub
Wed roman tut_open_datapubWed roman tut_open_datapub
Wed roman tut_open_datapub
 
Graph Structure in the Web - Revisited. WWW2014 Web Science Track
Graph Structure in the Web - Revisited. WWW2014 Web Science TrackGraph Structure in the Web - Revisited. WWW2014 Web Science Track
Graph Structure in the Web - Revisited. WWW2014 Web Science Track
 
First Steps in Semantic Data Modelling and Search & Analytics in the Cloud
First Steps in Semantic Data Modelling and Search & Analytics in the CloudFirst Steps in Semantic Data Modelling and Search & Analytics in the Cloud
First Steps in Semantic Data Modelling and Search & Analytics in the Cloud
 
Focused Crawling for Structured Data
Focused Crawling for Structured DataFocused Crawling for Structured Data
Focused Crawling for Structured Data
 
Getting Started With The Talis Platform
Getting Started With The Talis PlatformGetting Started With The Talis Platform
Getting Started With The Talis Platform
 
DBpedia - An Interlinking Hub in the Web of Data
DBpedia - An Interlinking Hub in the Web of DataDBpedia - An Interlinking Hub in the Web of Data
DBpedia - An Interlinking Hub in the Web of Data
 
Coming to REST
Coming to RESTComing to REST
Coming to REST
 
Open hpi semweb-06-part5
Open hpi semweb-06-part5Open hpi semweb-06-part5
Open hpi semweb-06-part5
 

Destacado

Destacado (9)

Context clues
Context cluesContext clues
Context clues
 
Cloud Foundry Summit 2015: 10 common errors when pushing apps to cloud foundry
Cloud Foundry Summit 2015: 10 common errors when pushing apps to cloud foundryCloud Foundry Summit 2015: 10 common errors when pushing apps to cloud foundry
Cloud Foundry Summit 2015: 10 common errors when pushing apps to cloud foundry
 
Context Clues(New)
Context Clues(New)Context Clues(New)
Context Clues(New)
 
Context Clues
Context CluesContext Clues
Context Clues
 
Context Clues
Context CluesContext Clues
Context Clues
 
Context Clues
Context CluesContext Clues
Context Clues
 
Context Clues
Context CluesContext Clues
Context Clues
 
Context clues
Context cluesContext clues
Context clues
 
Context Clues
Context CluesContext Clues
Context Clues
 

Similar a Heuristics for Fixing Common Errors in Deployed schema.org Microdata

What the Adoption of schema.org Tells about Linked Open Data
What the Adoption of schema.org Tells about Linked Open DataWhat the Adoption of schema.org Tells about Linked Open Data
What the Adoption of schema.org Tells about Linked Open DataHeiko Paulheim
 
Industry Ontologies: Case Studies in Creating and Extending Schema.org for In...
Industry Ontologies: Case Studies in Creating and Extending Schema.org for In...Industry Ontologies: Case Studies in Creating and Extending Schema.org for In...
Industry Ontologies: Case Studies in Creating and Extending Schema.org for In...MakoLab SA
 
Industry Ontologies: Case Studies in Creating and Extending Schema.org
Industry Ontologies: Case Studies in Creating and Extending Schema.org Industry Ontologies: Case Studies in Creating and Extending Schema.org
Industry Ontologies: Case Studies in Creating and Extending Schema.org sopekmir
 
Future of Data Strategy (ASEAN)
Future of Data Strategy (ASEAN)Future of Data Strategy (ASEAN)
Future of Data Strategy (ASEAN)Denodo
 
The Web Data Commons Microdata, RDFa, and Microformat Dataset Series @ ISWC2014
The Web Data Commons Microdata, RDFa, and Microformat Dataset Series @ ISWC2014The Web Data Commons Microdata, RDFa, and Microformat Dataset Series @ ISWC2014
The Web Data Commons Microdata, RDFa, and Microformat Dataset Series @ ISWC2014Robert Meusel
 
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed DeployedCrossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed DeployedRobert Grossman
 
L'architettura di classe enterprise di nuova generazione - Massimo Brignoli
L'architettura di classe enterprise di nuova generazione - Massimo BrignoliL'architettura di classe enterprise di nuova generazione - Massimo Brignoli
L'architettura di classe enterprise di nuova generazione - Massimo BrignoliData Driven Innovation
 
A possible future role of schema.org for business reporting
A possible future role of schema.org for business reportingA possible future role of schema.org for business reporting
A possible future role of schema.org for business reportingsopekmir
 
20131111 - Santa Monica - BigDataCamp - Big Data Design Patterns
20131111 - Santa Monica - BigDataCamp - Big Data Design Patterns20131111 - Santa Monica - BigDataCamp - Big Data Design Patterns
20131111 - Santa Monica - BigDataCamp - Big Data Design PatternsAllen Day, PhD
 
GoodRelations & RDFa for Deep Comparison Shopping on a Web Scale
GoodRelations & RDFa for Deep Comparison Shopping on a Web ScaleGoodRelations & RDFa for Deep Comparison Shopping on a Web Scale
GoodRelations & RDFa for Deep Comparison Shopping on a Web ScaleMartin Hepp
 
USING FACTORY DESIGN PATTERNS IN MAP REDUCE DESIGN FOR BIG DATA ANALYTICS
USING FACTORY DESIGN PATTERNS IN MAP REDUCE DESIGN FOR BIG DATA ANALYTICSUSING FACTORY DESIGN PATTERNS IN MAP REDUCE DESIGN FOR BIG DATA ANALYTICS
USING FACTORY DESIGN PATTERNS IN MAP REDUCE DESIGN FOR BIG DATA ANALYTICSHCL Technologies
 
Insights into Real World Data Management Challenges
Insights into Real World Data Management ChallengesInsights into Real World Data Management Challenges
Insights into Real World Data Management ChallengesDataWorks Summit
 
Five Critical Success Factors for Big Data and Traditional BI
Five Critical Success Factors for Big Data and Traditional BIFive Critical Success Factors for Big Data and Traditional BI
Five Critical Success Factors for Big Data and Traditional BIInside Analysis
 
Oracle BI Hybrid BI : Mode 1 + Mode 2, Cloud + On-Premise Business Analytics
Oracle BI Hybrid BI : Mode 1 + Mode 2, Cloud + On-Premise Business AnalyticsOracle BI Hybrid BI : Mode 1 + Mode 2, Cloud + On-Premise Business Analytics
Oracle BI Hybrid BI : Mode 1 + Mode 2, Cloud + On-Premise Business AnalyticsMark Rittman
 
XML Publisher (www.aboutoracleapps.com)
XML Publisher (www.aboutoracleapps.com)XML Publisher (www.aboutoracleapps.com)
XML Publisher (www.aboutoracleapps.com)Chris Martin
 
What is the future of data strategy?
What is the future of data strategy?What is the future of data strategy?
What is the future of data strategy?Denodo
 
L’architettura di Classe Enterprise di Nuova Generazione
L’architettura di Classe Enterprise di Nuova GenerazioneL’architettura di Classe Enterprise di Nuova Generazione
L’architettura di Classe Enterprise di Nuova GenerazioneMongoDB
 
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data LakesWebinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data LakesMongoDB
 

Similar a Heuristics for Fixing Common Errors in Deployed schema.org Microdata (20)

What the Adoption of schema.org Tells about Linked Open Data
What the Adoption of schema.org Tells about Linked Open DataWhat the Adoption of schema.org Tells about Linked Open Data
What the Adoption of schema.org Tells about Linked Open Data
 
Industry Ontologies: Case Studies in Creating and Extending Schema.org for In...
Industry Ontologies: Case Studies in Creating and Extending Schema.org for In...Industry Ontologies: Case Studies in Creating and Extending Schema.org for In...
Industry Ontologies: Case Studies in Creating and Extending Schema.org for In...
 
Industry Ontologies: Case Studies in Creating and Extending Schema.org
Industry Ontologies: Case Studies in Creating and Extending Schema.org Industry Ontologies: Case Studies in Creating and Extending Schema.org
Industry Ontologies: Case Studies in Creating and Extending Schema.org
 
Future of Data Strategy (ASEAN)
Future of Data Strategy (ASEAN)Future of Data Strategy (ASEAN)
Future of Data Strategy (ASEAN)
 
The Web Data Commons Microdata, RDFa, and Microformat Dataset Series @ ISWC2014
The Web Data Commons Microdata, RDFa, and Microformat Dataset Series @ ISWC2014The Web Data Commons Microdata, RDFa, and Microformat Dataset Series @ ISWC2014
The Web Data Commons Microdata, RDFa, and Microformat Dataset Series @ ISWC2014
 
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed DeployedCrossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
 
L'architettura di classe enterprise di nuova generazione - Massimo Brignoli
L'architettura di classe enterprise di nuova generazione - Massimo BrignoliL'architettura di classe enterprise di nuova generazione - Massimo Brignoli
L'architettura di classe enterprise di nuova generazione - Massimo Brignoli
 
A possible future role of schema.org for business reporting
A possible future role of schema.org for business reportingA possible future role of schema.org for business reporting
A possible future role of schema.org for business reporting
 
20131111 - Santa Monica - BigDataCamp - Big Data Design Patterns
20131111 - Santa Monica - BigDataCamp - Big Data Design Patterns20131111 - Santa Monica - BigDataCamp - Big Data Design Patterns
20131111 - Santa Monica - BigDataCamp - Big Data Design Patterns
 
GoodRelations & RDFa for Deep Comparison Shopping on a Web Scale
GoodRelations & RDFa for Deep Comparison Shopping on a Web ScaleGoodRelations & RDFa for Deep Comparison Shopping on a Web Scale
GoodRelations & RDFa for Deep Comparison Shopping on a Web Scale
 
USING FACTORY DESIGN PATTERNS IN MAP REDUCE DESIGN FOR BIG DATA ANALYTICS
USING FACTORY DESIGN PATTERNS IN MAP REDUCE DESIGN FOR BIG DATA ANALYTICSUSING FACTORY DESIGN PATTERNS IN MAP REDUCE DESIGN FOR BIG DATA ANALYTICS
USING FACTORY DESIGN PATTERNS IN MAP REDUCE DESIGN FOR BIG DATA ANALYTICS
 
Insights into Real World Data Management Challenges
Insights into Real World Data Management ChallengesInsights into Real World Data Management Challenges
Insights into Real World Data Management Challenges
 
Five Critical Success Factors for Big Data and Traditional BI
Five Critical Success Factors for Big Data and Traditional BIFive Critical Success Factors for Big Data and Traditional BI
Five Critical Success Factors for Big Data and Traditional BI
 
Oracle BI Hybrid BI : Mode 1 + Mode 2, Cloud + On-Premise Business Analytics
Oracle BI Hybrid BI : Mode 1 + Mode 2, Cloud + On-Premise Business AnalyticsOracle BI Hybrid BI : Mode 1 + Mode 2, Cloud + On-Premise Business Analytics
Oracle BI Hybrid BI : Mode 1 + Mode 2, Cloud + On-Premise Business Analytics
 
XML Publisher (www.aboutoracleapps.com)
XML Publisher (www.aboutoracleapps.com)XML Publisher (www.aboutoracleapps.com)
XML Publisher (www.aboutoracleapps.com)
 
Dataweek-Talk-2014
Dataweek-Talk-2014Dataweek-Talk-2014
Dataweek-Talk-2014
 
What is the future of data strategy?
What is the future of data strategy?What is the future of data strategy?
What is the future of data strategy?
 
L’architettura di Classe Enterprise di Nuova Generazione
L’architettura di Classe Enterprise di Nuova GenerazioneL’architettura di Classe Enterprise di Nuova Generazione
L’architettura di Classe Enterprise di Nuova Generazione
 
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data LakesWebinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
 
Pratical Deep Dive into the Semantic Web - #smconnect
Pratical Deep Dive into the Semantic Web - #smconnectPratical Deep Dive into the Semantic Web - #smconnect
Pratical Deep Dive into the Semantic Web - #smconnect
 

Último

Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxgindu3009
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRDelhi Call girls
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPirithiRaju
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfSumit Kumar yadav
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfSumit Kumar yadav
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTSérgio Sacani
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfmuntazimhurra
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...Sérgio Sacani
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxUmerFayaz5
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINsankalpkumarsahoo174
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxAArockiyaNisha
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPirithiRaju
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSarthak Sekhar Mondal
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)Areesha Ahmad
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksSérgio Sacani
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticssakshisoni2385
 
fundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomologyfundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomologyDrAnita Sharma
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 

Último (20)

Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdf
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
fundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomologyfundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomology
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 

Heuristics for Fixing Common Errors in Deployed schema.org Microdata

  • 1. Heuristics for Fixing Common Errors in Deployed schema.org Microdata Robert Meusel and Heiko Paulheim
  • 2. 2 Motivation Heuristics for Fixing Common Errors in Deployed schema.org Microdata - Meusel and Paulheim - ESWC 2015
  • 3. 3 Microdata in a Nutshell  Adding structured information to web pages • By marking up contents and entities  Arbitrary vocabularies are possible • Practically, only schema.org is deployed on a large scale • Plus its historical predecessor: data-vocabulary.org  Similar to RDFa Heuristics for Fixing Common Errors in Deployed schema.org Microdata - Meusel and Paulheim - ESWC 2015 <div itemscope itemtype="http://schema.org/PostalAddress"> <span itemprop="name">Data and Web Science Group</span> <span itemprop="addressLocality">Mannheim</span>, <span itemprop="postalCode">68131</span> <span itemprop="addressCountry">Germany</span> </div>
  • 4. 4 Schema.org in a Nutshell  Vocabulary for marking up entities on web pages • 675 classes and 965 properties (as of May 2015, release 2.0)  Promoted and consumes by major search engine companies • Google, Bing, Yahoo!, and Yandex • Google Rich Snippets  Community-driven evolution and development  Can be used with Microdata and RDFa • Hardly used together with RDFa (<0.1% of RDFa-using websites [1]) Heuristics for Fixing Common Errors in Deployed schema.org Microdata - Meusel and Paulheim - ESWC 2015 [1] http://webdatacommons.org/structureddata/2014-12/stats/stats.html
  • 5. 5 Schema.org in a Nutshell – Coverage  Schema.org has incorporated some popular vocabularies, like: • Good Relations (2012) • W3C BibExtend (2014) • MusicBrainz vocabulary (2015) • Automotive Ontology (2015) Heuristics for Fixing Common Errors in Deployed schema.org Microdata - Meusel and Paulheim - ESWC 2015
  • 6. 6 Microdata with Schema.org in HTML Pages <html> … <body> … <div id="main-section" class="performance left" data- sku="M17242_580“> <h1> Predator Instinct FG Fußballschuh </h1> <div> <meta content="EUR"> <span data-sale-price="219.95">219,95</span> … </body> </html> HTML pages embed directly markup languages to annotate items using different vocabularies <html> … <body> … <div id="main-section" class="performance left" data- sku="M17242_580" itemscope itemtype="http://schema.org/Product"> <h1 itemprop="name"> Predator Instinct FG Fußballschuh </h1> <div itemscope itemtype="http://schema.org/Offer" itemprop="offers"> <meta itemprop="priceCurrency" content="EUR"> <span itemprop="price" data-sale- price="219.95">219,95</span> … </body> </html> 1._:node1 <http://www.w3.org/1999/02/22-rdf-syntax- ns#type> <http://schema.org/Product> . 2._:node1 <http://schema.org/Product/name> "Predator Instinct FG Fußballschuh"@de . 3._:node1 <http://www.w3.org/1999/02/22-rdf-syntax- ns#type> <http://schema.org/Offer> . 4._:node1 <http://schema.org/Offer/price> "219,95"@de . 5._:node1 <http://schema.org/Offer/priceCurrency> "EUR" . 6.… Heuristics for Fixing Common Errors in Deployed schema.org Microdata - Meusel and Paulheim - ESWC 2015
  • 7. 7 So Far, So Good …  Schema is well explained on the schema.org websites  Data providers are supported by validation tools (e.g. Yandex structured data validator) when deploying  Win-Win for both sides  Plus: Data is (mostly) free accessible in the Web …. but:  >100.000s of data providers, which are mostly no schema.org experts or evangelists  Validators & schema might help but there is no need to use them Heuristics for Fixing Common Errors in Deployed schema.org Microdata - Meusel and Paulheim - ESWC 2015
  • 8. 8 So What Could Possibly Go Wrong?  Usage of wrong namespaces • http./schema.org  Usage of undefined types • http://schema.org/Breadcrumb  Usage of undefined properties • http://schema.org/postID  Confusion of datatype properties and object properties • _:n1 s:address “Jump Street 21”  Property domain and range violations • _:n1 a s:Product _:n1 s:price “for free” Heuristics for Fixing Common Errors in Deployed schema.org Microdata - Meusel and Paulheim - ESWC 2015
  • 9. 9 Compiling a Schema.org Dataset  Starting point: all pages in the CommonCrawl that contain Microdata  What could be (meant to be) schema.org? • Everything that contains “schema.org” as substring in a namespace • Everything that contains URIs where the protocol and authority is similar to “http://schema.org/” (with an EditDistance of 1) • Filter noise: removing all namespaces that occur only on one website Heuristics for Fixing Common Errors in Deployed schema.org Microdata - Meusel and Paulheim - ESWC 2015 Final corpus consists of: 6.4 billion triples extracted from over 217 billion pages belonging to 398,542 data providers which is 86% of all Microdata in the corpus.
  • 10. 10 Namespace Violations  More than 98% of the preselected pages use a correct namespace  Frequent namespace variations: • http://www.schema.org/ • https://schema.org • http:/schema.org • http://SChema.org Heuristics for Fixing Common Errors in Deployed schema.org Microdata - Meusel and Paulheim - ESWC 2015 Debated!
  • 11. 11 Undefined Types  Used by around 6% of all data providers  Typical causes: • Misspellings: http://schema.org/Stores • Miscapitalization: http://schema.org/localbusiness  Comparison to LOD Compliance • 5.8% of all Microdata documents • 38.8% of all LOD documents (Hogan et al., 2010) Heuristics for Fixing Common Errors in Deployed schema.org Microdata - Meusel and Paulheim - ESWC 2015 …/Store …/LocalBusiness
  • 12. 12 Undefined Properties  Used by around 4% of all data providers  Typical Causes: • Miscapitalization: http://schema.org/contentURL • Close but miss: http://schema.org/currency http://schema.org/fax • Made up: http://schema.org/blogId http://schema.org/postId  Comparison to LOD Compliance • 9.7% of all Microdata documents • 72.4% of all LOD documents (Hogan et al., 2010) Heuristics for Fixing Common Errors in Deployed schema.org Microdata - Meusel and Paulheim - ESWC 2015 …/contentUrl …/priceCurrency
  • 13. 13 Confusion of Object Properties with Data Properties  i.e. using an object property with a string values  Used by over 56.6% of all data providers  Typical properties: • http://schema.org/addresscountry • http://schema.org/manufacturer • http://schema.org/author • http://schema.org/brand  Comparison to LOD Compliance • 24.35% of all Microdata documents • 8% of all LOD documents (Hogan et al., 2010) Heuristics for Fixing Common Errors in Deployed schema.org Microdata - Meusel and Paulheim - ESWC 2015
  • 14. 14 Confusion of Data Properties with Object Properties  i.e. using a data property with a complex object  Used by less than 0.2% of all data providers  Comparison to LOD Compliance • 0.6% of all Microdata documents • 2.2% of all LOD documents (Hogan et al., 2010) Heuristics for Fixing Common Errors in Deployed schema.org Microdata - Meusel and Paulheim - ESWC 2015
  • 15. 15 Property Domain Violations  i.e. using a property with a subject not included in its domain  Used by 4% of all data providers  Typical violations are mainly shortcuts • s:price used on s:Product • s:streetAddress used on s:LocalBusiness  Comparison to LOD Compliance: • Difficult to compare as semantics are different • List of schema.org domains is exhaustive • LOD: open world assumption Heuristics for Fixing Common Errors in Deployed schema.org Microdata - Meusel and Paulheim - ESWC 2015 s:Product  s:Offer  s:price s:LocalBusiness  s:PostalAddress  s:streetAddress
  • 16. 16 Data Property Range Violations  i.e. using a data property with an incompatible literal  Used by 9.6% of all data providers  20 most common violations: • 13 dates • 3 Urls • 2 numbers • 2 times  Comparison to LOD Compliance: • 12.06% of all Microdata documents • 4.6% of all LOD documents (Hogan et al., 2010) Heuristics for Fixing Common Errors in Deployed schema.org Microdata - Meusel and Paulheim - ESWC 2015 “a month ago” “2 pieces” “last week”
  • 17. 17 Object Property Range Violations  i.e. using an object property with a type outside its range  Used by 8.6% of all data providers  Typical violations: • s:mainContentOfPage with s:Blog instead of s:WebPageElement  Comparison to LOD Compliance • 3.2% of all Microdata documents • 2.4% of all LOD documents (Hogan et al., 2010) Heuristics for Fixing Common Errors in Deployed schema.org Microdata - Meusel and Paulheim - ESWC 2015 Maybe a hint at a missing hierarchy relation?
  • 18. 18 Schema.org Compliance Summary  Surprisingly high level of compliance  Providers are often not technology evangelists (unlike in LOD) • Anybody can start publishing Microdata annotated HTML  Most often higher than for LOD • Except for the confusion of data and object properties Heuristics for Fixing Common Errors in Deployed schema.org Microdata - Meusel and Paulheim - ESWC 2015 But still the number of erroneous pages could prevent data consumers to make use of the annotated data and understand the semantics.
  • 19. 19 Identifying and Fixing Wrong Namespaces  Main errors due to missing slashes, wrong protocol and capitalization  Simple rules to handle wrong namespaces • Removal of www • Replacement of https by http • Conversion to lower case • Adding of missing slashes and removal of prefixes before schema.org  Impact: • 147 of 148 wrongly spelled namespaces could be fixed Heuristics for Fixing Common Errors in Deployed schema.org Microdata - Meusel and Paulheim - ESWC 2015
  • 20. 20 Handling Undefined Types and Properties  Main errors due to wrong capitalization  Heuristic: Ignore capitalization when parsing entities from web pages, and replace the schema element with the properly capitalized version  Impact (together with namespace fixes): • Correct type replacement within 71% of all data providers • Correct property replacement within 65% of all data providers • Remaining data providers account for over 70% of all undefined types and properties and are hard-to-detect typos Heuristics for Fixing Common Errors in Deployed schema.org Microdata - Meusel and Paulheim - ESWC 2015
  • 21. 21 Handling Object Properties with Literal Values  Main objects modeled as literals are s:Organization, s:Person and s:PostalAddress  Manually inspecting those values for the object properties s:author, s:creator and s:address  Impact • The heuristic could replace all misused object properties on 92,449 data providers • Might lead to changes in the type distribution • E.g. 14 million new entities of type s:PostalAddress Heuristics for Fixing Common Errors in Deployed schema.org Microdata - Meusel and Paulheim - ESWC 2015 _:1 s:author “Robert” . _:1 s:author _:2 . _:2 a s:Person . _:2 s:name “Robert” .
  • 22. 22 Handling Property Domain Violations  Main cause are shortcuts  Heuristic to find the property R and type T for a domain violation of property s:r: One unique solution for only one of the two patterns:  Impact: • 31% of erroneous data providers could be fixed • No solution or multiple solutions for the rest Heuristics for Fixing Common Errors in Deployed schema.org Microdata - Meusel and Paulheim - ESWC 2015 _:1 “5” s:aggregatedRating s:aggregatedRating is not defined for type of _:1 _:2 s:aggregatedRating Type? Property? R s:domainIncludes s:t . R s:rangeIncludes T . s:r s:domainIncludes T . R s:rangeIncludes s:t . R s:domainIncludes T . s:r s:domainIncludes T .
  • 23. 23 Heuristics Summary  Over 410 million wrong triples could be corrected  Over 700 million missing triples could be added  Corrections affected in total over 115.000 data providers • ~ 28% of all data providers in the data set Heuristics for Fixing Common Errors in Deployed schema.org Microdata - Meusel and Paulheim - ESWC 2015
  • 24. 24 LD4IE Challenge @ ISWC 2015 Heuristics for Fixing Common Errors in Deployed schema.org Microdata - Meusel and Paulheim - ESWC 2015 Learn to annotate entities on HTML pages using already annotated pages as training set.  Deadline: 2015-07-15  Challenge Page: goo.gl/laF6yl  Contact: Heiko Paulheim (heiko@dwslab.de) Good Luck!
  • 25. 25 Thank you! Questions? Feedback? Data and more insights can be found at: http://webdatacommons.org/structureddata/2013- 11/stats/fixing_common_errors.html More interesting datasets and analysis can be found at the website of WebDataCommons: http://webdatacommons.org/index.html Heuristics for Fixing Common Errors in Deployed schema.org Microdata - Meusel and Paulheim - ESWC 2015 Acknowledgement The extraction and analysis of the datasets was supported by AWS in Education Grant.