SlideShare una empresa de Scribd logo
1 de 53
Descargar para leer sin conexión
Data and Web Science GroupCompleting Knowledge Graphs using Data 
from the Open Web
1Nov. 25‐27, 2019, Hangzhou, China
Prof. Dr. Christian Bizer
JIST2019
Data and Web Science Group
– Cross‐Domain Knowledge Graphs
– Domain‐Specific Knowledge Graphs
Knowledge Graphs
Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.26 2
Product Knowledge
Graphs
Scholarly
Knowledge Graphs
Life Science 
Knowledge Graphs
Enterprise
Knowledge Graphs
Data and Web Science Group
Knowledge Graphs are Incomplete
Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.26 3
P1 P2 P3 P4 … PN ? ? ? …
E1 ? ? ? ? ? ? ?
E2 ? ? ? ? ? ?
E3 ? ? ? ?
… ? ? ? ? ?
Em ? ? ? ? ? ? ? ?
? ? ? ? ? ? ? ? ? ? ?
? ? ? ? ? ? ? ? ? ? ?
? ? ? ? ? ? ? ? ? ? ?
… ? ? ? ? ? ? ? ? ? ?
Class in KG
P1 Property
E1 Entity
Fact
Source: Luna Dong
Data and Web Science Group
4
Types of Incompleteness
P1 P2 P3 P4 … PN ? ? ? …
E1 ? ? ? ? ? ? ?
E2 ? ? ? ? ? ?
E3 ? ? ? ?
… ? ? ? ? ?
Em ? ? ? ? ? ? ? ?
? ? ? ? ? ? ? ? ? ? ?
? ? ? ? ? ? ? ? ? ? ?
? ? ? ? ? ? ? ? ? ? ?
… ? ? ? ? ? ? ? ? ? ?
Class in KG
Relevant facts 
are missing
Relevant 
properties 
are missing 
in schema
Relevant
entities
are not 
described
Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.26
Data and Web Science Group
5
Knowledge Graph Completion Tasks
P1 P2 P3 P4 … PN ? ? ? …
E1 ? ? ? ? ? ? ?
E2 ? ? ? ? ? ?
E3 ? ? ? ?
… ? ? ? ? ?
Em ? ? ? ? ? ? ? ?
? ? ? ? ? ? ? ? ? ? ?
? ? ? ? ? ? ? ? ? ? ?
? ? ? ? ? ? ? ? ? ? ?
… ? ? ? ? ? ? ? ? ? ?
Link Prediction
add object property values for 
existing instances and properties
Literal Slot Filling
add datatype property values for 
existing instances and properties
Schema Expansion
add and fill new properties
Entity Expansion
• add new entities
• and their descriptions (values 
for existing properties)
Class in KG
Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.26
Data and Web Science Group
Outline
1. Internal Knowledge Graph Completion
2. KG Completion using Small Sets of External Sources
3. KG Completion using Web Table Data
– Profile of HTML Tables on the Web
– Potential for KG Completion
4. KG Completion using Schema.org Annotations
– Profile of Schema.org Annotations on Web
– Potential for KG Completion
5. Conclusions
6Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.26
Data and Web Science Group
1. Internal Knowledge Graph Completion
– Predict missing facts by exploiting patterns in the graph
– Classic value imputation applied to knowledge graphs 
– Symbolic  methods
– Rule learning, e.g. AIME, AnyBURL
– Heutistics, e.g. SDType
– Sub‐symbolic methods
– Knowledge Graph Embeddings
Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.26 7
Paulheim: Knowledge Graph Refinement: A Survey of Approaches and Evaluation Methods. SWJ 2017.
Nguyen: An Overview of Embedding Models of Entities and Relationships for Knowledge Base Completion. ArXiv, 2019.
Data and Web Science Group
Knowledge Graph Embeddings:
Hits@10 Results on Link Prediction Task 
8Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.26
Nguyen: An Overview of Embedding Models of Entities and Relationships for Knowledge Base Completion. ArXiv, 2019.
Data and Web Science Group
Weaknesses of Current Research
on Knowledge Graph Embeddings
1. Only link prediction covered
– What about literal slot filling? Entity expansion? Schema expansion? 
2. Evaluations self‐referencial
– Hit@10 not relevant for KB completion
– Hits@1 performance too bad for practical use
– Accuracy evaluated on simplified dataset: FB13, 6 relations dropped!
Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.26 9
Method Hits@1 on FB15K‐237
ComplEx (2018) 0.26
ConvE (2018) 0.24
TuckER (2019) 0.26
AnyBURL (2019) 0.23 (symbolic rule learner)
Meilicke, et al.: Fine‐grained Evaluation of Rule‐ and Embedding‐based Systems for Knowledge Graph Completion. 
Wang, Ruffiinelli: On Evaluating Embeddings Models for Knowledge Base Completion. RepL4NLP‐2019.
Data and Web Science Group
– Open License Sources
– Commercial Sources
2. KG Completion Using a Small Set 
of External Sources
Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.26 10
Baike
Data and Web Science Group
Integration Process
– Classic data integration 
techniques apply
– Small set of sources 
– we can invest effort into the
integration of each data source
– write rules, label training data
– quality depends on effort spend
– All task are covered
– link prediction, slot filling, 
– entity expansion, schema expansion
Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.26 11
Data Collection / Extraction
Schema Mapping
Data Translation
Entity Matching
Data Quality Assessment
Data Fusion
AnHai, Halevy, Ives: Principles of Data Integration. Morgan Kaufmann, 2012.
Data and Web Science Group
KG Completion Example
– YAGO2 completed using Geonames
– 7 million additonal entities from Geonames
– 10 additional object properties (relations)
– 320 million triples from Geonames
– Accuracy of spartial relations (manual evaluation)
Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.26 12
Hoffart, et al.: YAGO2: A spatially and temporally enhanced knowledge base extracted from Wikipedia. JWS 2013.
Data and Web Science Group
Challenges
1. Coverage of external sources
– Relevant entities and relevant properties might not be covered
– e.g. Wikipedia
2. Up‐to‐dateness of external sources
– Are sources regularly maintained?
– e.g. LOD cloud, research data 
3. Cost of external sources
– Complete and up‐to‐date data often costs money
– Large companies can pay, other organizations often not
Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.26 13
Data and Web Science Group
Let‘s use the Open Web
Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.26 14
Data and Web Science Group
Let‘s use the Open Web
– Potentials 
1. Coverage of long‐tail entities
2. Coverage of long‐tail properties
3. Content often up‐to‐date
– Challenges
1. Heterogeneity
2. Data quality depends on website
Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.26 15
Data and Web Science Group
Web Tables
16Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.26
Data and Web Science Group
Schema.org Annotations
17
<div  itemtype="http://schema.org/Hotel">
<span itemprop="name">Vienna Marriott Hotel</span>
<span itemprop="address" itemtype="http://schema.org/PostalAddress">
<span itemprop="streetAddress">Parkring 12a</span>
<span itemprop="addressLocality">Vienna</span>
</span>
</div>
<div  itemtype="http://schema.org/Product">
<h1 itemprop="name">Signature Instant Dome 7</h1>
Item # <span itemprop=“productID">200734246807</span>
<img itemprop=“image“ src=“../20073424680.jpg">
<span itemprop=“review" itemtype="http://schema.org/Review">
<span itemprop=“ratingValue">5</span>
<span itemprop=“reviewBody">Great waterproof tent!</span>
</span>
</div>
Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.26
Data and Web Science Group
3. Web Tables
18
Cafarella, et al.: WebTables: Exploring the Power of Tables on the Web. VLDB 2008.
Crestan, Pantel: Web‐Scale Table Census and Classification. WSDM 2011.
In corpus of 14B raw tables, 154M are relational tables (1.1%).
Cafarella et al. 2008
Our focus
Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.26
Data and Web Science Group
Web Data Commons – Web Tables Corpus
19
Lehmberg, Ritze, Meusel, Bizer: A Large Public Corpus of Web Tables containing Time and Context Metadata. 
WWW2016 Companion.
– extracted from Common Crawl 2015
– 1.78 billion pages
– 10.2 billion raw HTML tables
– Size of the web table corpus
– 90 million relational tables (0.9%)
– 139 million attribute/value tables (1,3%)
– uses 100 machines on Amazon EC2 
– approx. 2000 machine/hours  250 €
– Download
– http://webdatacommons.org/webtables/ 
Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.26
Data and Web Science Group
Task 1:
Slot Filling and Link Prediction
20
A1 A2 A3 A4 … AN
E1 ? ? ?
E2 ? ?
E3
… ?
Em ? ? ? ?
DBpedia Class
Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.26
Data and Web Science Group
Two Step Process
21
1. Matching 2. Data Fusion
IATA Code
Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.26
Data and Web Science Group
Holistic Entity‐ and Schema Matching
22
– T2K+ matches
– millions of web tables to the
– DBpedia knowledge base
– Exploits 
– wide range of features
– synergies between matching tasks
– table corpus for indirect matching
– patterns from KB for post filtering
– Results
Dominique Ritze, Christian Bizer: Matching Web Tables To DBpedia – A Feature Utility Study. EDBT 2017.
Task Precision Recall F1
Class .94 .94 .94
Instance .90 .76 .82
Property .77 .65 .70
Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.26
Data and Web Science Group
ISWC2019 Challenge on Tabular Data 
to Knowledge Graph Matching
– Multiple tasks
– CPA task = property matching
Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.26 23
http://www.cs.ox.ac.uk/isg/challenges/sem‐tab/
Data and Web Science Group
Large‐Scale Matching Results 
24
– 1 million out of 30 million tables match DBpedia (~3%)
– 301,450 matching tables have attribute correspondences (~1%) 
– Results in 8 million triples describing 717,000 DBpedia entities 
Ritze, et al.: Profiling the Potential of Web Tables for Augmenting Cross‐domain Knowledge Bases. WWW2016.
Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.26
Data and Web Science Group
– Fusion Results
– Group Sizes
– Large groups (Head): many values 
already exist in the KB
– Small groups (Tail): often new 
value but difficult to fuse correctly
Data Fusion Results
25
Fusion Strategy Precision Recall F1
Voting .369 .823 .509
PageRank .365 .814 .504
Knowledge-based Trust .639 .785 .705
Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.26
Data and Web Science Group
What is going wrong? 
Topical Mismatch
26
Websites contributing many tables # Tables Topic
apple.com 50,910 Music
baseball‐reference.com 25,647 Sports
latestf1news.com 17,726 Sports
nascar.com 17,465 Sports
amazon.com 16,551 Products
wikipedia.org 13,993 Various
inkjetsuperstore.com 12,282 Products
flightmemory.com 8,044 Flights
windshieldguy.com 7,305 Products
citytowninfo.com 6,293 Cities
blogspot.com 4,762 Various
7digital.com 4,462 Music
Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.26
Data and Web Science Group
What is going wrong?
Attribute Semantics
27
https://www.bls.gov/oes/2017/may/oes_ca.htm
Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.26
Data and Web Science Group
– Matching and fusion methods expect binary relations 
– But many relations in Web tables are not binary!
28
Misinterpretation of Attribute Semantics
Occupation
code
Annual 
wage
11‐3021 $128,805
15‐0000 $113,689
15‐1111 $39,180
Occupation
code
Annual 
wage
11‐3021 $93,805
15‐0000 $83,472
15‐1111 $39,180
Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.26
Data and Web Science Group
Synthesizing N‐ary Relations from Web 
Tables
– Exploiting page context: Title and URL
– Stitching tables from the same site
29
Oliver Lehmberg, Christian Bizer: Synthesizing N‐ary Relations from Web Tables. WIMS2019.
Oliver Lehmberg, Christian Bizer: Stitching web tables for improving matching quality. VLDB Endowment, 2017.
Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.26
Data and Web Science Group
Amount of N‐ary Relations
30
Oliver Lehmberg, Christian Bizer: Profiling the Semantics of N‐ary Web Table Data. SBD Workshop @ SIGMOD 2019.
Results for the WDC2015 corpus:
– 37.50% of all relations are non‐binary
– 50.47% of these require context attributes
Time dimension is
often part of keys
but also many 
other attributes
Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.26
Data and Web Science Group
Task 2: Add Long‐Tail Entities 
to Knowledge Graph
31
A1 A2 A3 A4 … AN ? ? ? …
E1 ? ? ? ? ? ? ?
E2 ? ? ? ? ? ?
E3 ? ? ? ?
… ? ? ? ? ?
Em ? ? ? ? ? ? ? ?
? ? ? ? ? ? ? ? ? ? ?
? ? ? ? ? ? ? ? ? ? ?
? ? ? ? ? ? ? ? ? ? ?
… ? ? ? ? ? ? ? ? ? ?
Entity Expansion:
1. Find new entities
2. Compile descriptions of 
these new entities
Class in KB
Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.26
Data and Web Science Group
Long‐Tail Entity Expansion Pipeline
32
Row 
Clustering
Entity 
Creation
New 
Detection
Knowledge 
base
New entities added to 
knowledge base
Schema 
Matching
Web tables
Output of first iteration used to refine the 
schema mapping in a second iteration
Yaser Oulabi, Christian Bizer: Extending cross‐domain knowledge bases with long tail entities using web table 
data. EDBT 2019.
Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.26
1. Cluster rows that describe the same instance together
2. Create entities by fusing rows in clusters
3. Determine which entities describe new instances
by comparing then to instances in KB
Data and Web Science Group
Entity Expansion Results
33
Class New Entities New Facts New Entities 
Precision Goldst.
New Facts 
Precision Goldst.
Football Player 13,983 (+67%) 43,800 (+32%) 0.82 0.85
Song 186,943 (+356%) 393,711 (+125%) 0.72 0.79
 ‐  50.000  100.000  150.000  200.000  250.000
Football Player
Song
Existing New
Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.26
DBpedia has low coverage for songs, 
thus it is easy to find new ones
Data and Web Science Group
Conclusion: Web Tables
– Extremely wide range of very specific attributes
– many n‐ary relations with complex keys / time dimension
– table context is often required to understand the data
– Coverage of long tail entitles beyond Wikipedia
– Things work OK, but not good enough yet
– fact precision slot filling: 0.64 (without stitching)
– new entities accuracy:  0.72‐0.82
– For researchers: Rewarding but challenging domain
– For practitioners: Human in the loop still required for 
verification
34Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.26
Data and Web Science Group
4. Schema.org Annotations
35
– ask site owners since 2011 to 
annotate data for enriching 
search results
– 675 Types: Event, Place, Local 
Business, Product, Review, Person  
– Encoding: Microdata, RDFa, JSON‐LD
Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.26
Data and Web Science Group
Usage of Schema.org Data
Data snippets
within
search results
Local
businesses on
maps
Data snippets
in knowledge
panels
Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.26
Data and Web Science Group
Web Data Commons – Structured Data
37
– extracts all Microformat, Microdata, 
RDFa, JSON‐LD data from the Common Crawl
– analyzes and provides the extracted data for download
– statistics of some extraction runs
– 2018 CC Corpus: 2.5 billion HTML pages  31.5 billion RDF triples
– 2017 CC Corpus: 3.1 billion HTML pages  38.2 billion RDF triples
– 2014 CC Corpus: 2.0 billion HTML pages  20.4 billion RDF triples
– 2010 CC Corpus: 2.8 billion HTML pages  5.1 billion RDF triples
– Download
– http://webdatacommons.org/structureddata/
Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.26
Data and Web Science Group
Overall Adoption 2018
http://webdatacommons.org/structureddata/2018‐12/
38
944 million HTML pages out of the 2.5 billion pages
provide semantic annotations (37.1%).
9.6 million pay-level-domains (PLDs) out of the
32.8 million pay-level-domains covered by the crawl
provide semantic annotations (29.3%).
Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.26
Data and Web Science Group
Frequently used Schema.org Classes
39
Top	Classes #	Websites	(PLDs)	
Microdata JSON‐LD
schema:WebPage 1,124,583 121,393
schema:Product 812,205 40,169
schema:Offer 676,899 57,756
schema:BreadcrumbList 621,344 205,971
schema:Article 612,361 57,082
schema:Organization 510,069 1,349,775
schema:PostalAddress 502,615 176,500
schema:ImageObject 360,875 111,946
schema:Blog 337,843 12,174
schema:Person 324,349 335,784
schema:LocalBusiness 294,390 249,017
schema:AggregateRating 258,078 23,105
schema:Review 124,022 6,622
schema:Place 92,127 66,396
schema:Event 88,130 63,605
http://webdatacommons.org/structureddata/2018‐12/ 
Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.26
Data and Web Science Group
Entities in WDC Dataset
for KG Completion
40
http://webdatacommons.org/structureddata/2018‐12/stats/schema_org_subsets.html
Class #	Entities Names
Product 368,386,801	
Organization 176,365,220	
Local Business 44,100,625	
Place 31,748,612
Event 26,035,295	
Movie 5,233,520
Hotel 2,537,089
College	or	University 2,392,797
Recipe	 2,036,417
Music	Album 1,594,941
Restaurant 1,480,443
Airport 1,145,323
as Common Crawl is rather shallow
Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.26
Data and Web Science Group
Use Case 1: Completing KG Describing
Books and Movies
– Results of KnowMore experiments
– Entity matching F1: book 0.88, movie 0.83
– New fact precision: book 0.88, movie 0.95
– Density increase for properties of existing entites
Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.26 41
Yu, et al.: KnowMore–knowledge base augmentation with structured web markup. SWJ 2019.
Movies
Books
Data and Web Science Group
42
Top	Attributes PLDs	Microdata
# %
schema:Product/name 754,812 92	%
schema:Product/offers 645,994 79	%
schema:Offer/price 639,598	 78	%
schema:Offer/priceCurrency 606,990 74	%
schema:Product/image 573,614 70	%
schema:Product/description 520,307 64	%
schema:Offer/availability 477,170 58	%
schema:Product/url 364,889 44	%
schema:Product/sku 160,343	 19	%
schema:Product/aggregateRating 141,194 17	%
schema:Product/brand 113,209 13	%
schema:Product/category 62,170	 7	%
schema:Product/productID 47,088 5	%
… … …
http://webdatacommons.org/structureddata/2018‐12/stats/html‐md.xlsx
Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.25
Das Samsung Galaxy S4 ist der
unterhaltsame und hilfreiche Begleiter
für Ihr mobiles Leben. Es verbindet Sie
mit Ihren Liebsten. Es lässt Sie
gemeinsam unvergessliche Momente
erleben und festhalten. Es vereinfacht
Ihren Alltag.
UPC 610214632623
000214632623
Use Case 2: Completing Product KGs
Data and Web Science Group
HTML Tables Containing
Product Specifications
43
Qui, et al.: DEXTER: Large‐Scale Discovery and Extraction of Product Specifications on the Web. VLDB 2015.
Petrovski, et al: The WDC Gold Standards for Product Feature Extraction and Product Matching. ECWeb 2016.
s:name
Specifications
as HTML Table
s:breadcrumb
Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.25
Data and Web Science Group
Use Product ID Annotations
as Supervision for Product Matching
– Some e‐shops annotate product IDs
– Most e‐shops do not 
44
Properties PLDs
# %
schema:Product/name 754,812 92	%
schema:Product/description 520,307 64	%
schema:Product/sku 160,343	 19	%
schema:Product/productID 47,088 5	%
schema:Product/mpn 12,882 1.6%
schema:Product/gtin13 7,994 1%
Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.26
Bizer, Primpeli, Peeters: Using the Semantic Web as a Source of Training Data. Datenbank Spektrum, 2019.
Data and Web Science Group
Combine Specification Table Content 
from All Offers for the Same Product
45
Product
offer
Product
offer
Product
offer
Product
offerProduct
offerProduct
offer
Product
offer
Product
offer
Product
offer
Product
offer
Product
offer
Product
offer
Clusters of offers
having the same product ID
Match additional
offers without
IDs
Learn
Matcher
Matcher
Product
offer
Product
offer
Combine spec
table content of
matching offers
Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.26
Data and Web Science Group
Use Case 3: Distant Supervision for 
Information Extraction
– Goal: Learn how to extract data from websites that do 
not provide schema.org annotations using annotations as 
training data.
– Example: Small Events Paper @ SIGIR 2015
– extract data about small venue concerts, theatre performances, 
garage sales, etc. 
– they use 217,000 schema.org events from ClueWeb2012 
as distant supervision, e.g.
46
Foley, Bendersky, Josifovski: Learning to Extract Local Events from the Web. SIGIR2015.
<div  itemtype="http://schema.org/MusicEvent">
<h1 itemprop="name">School Cool Concert</h1>
<p itemprop="startDate">25.09.2018 15:00</p>
<p itemprop=“location" itemtype="http://schema.org/Place">School Gym Puero High</span>
<div>
Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.26
Data and Web Science Group
Small Events Paper @ SIGIR 2015
– Approach
– train an information extraction method that
combines structural and text features
– extracted attributes: What? When? Where?
– apply method to pages without semantic
annotations
– Result 
– extract 200,000 additional small‐scale
events from ClueWeb2012 
– double recall at 0.85 precision
47
Foley, Bendersky, Josifovski: Learning to Extract Local Events from the Web. SIGIR2015.
Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.26
Data and Web Science Group
Use Case 4: Training Data for Taxonomy 
Matching and Taxonomy Clustering
– E‐Shops annotate their product categorization
– The marketing ideas behind these categorizations differ widely
48
Relevant	Properties #	PLDs %
schema:Offer/category 62,170 7	%	of	all	shops
schema:WebPage/BreadcrumbList 621,344 11%	of	all	sites
Apparel > Summer > Mens > Jerseys
Philadelphia Eagles > Philadelphia Eagles Mens > Philadelphia Eagles Mens Jersey > over $60 s
Home > Outdoor & Garden > Barbecues & Outdoor Living > Garden Furniture > Tables
Shop > Tables > Dining Tables > Aland Wood Tables
Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.26
Meusel, etal.: Exploiting Microdata Annotations to Consistently Categorize Product Offers at Web Scale. EC‐Web 2015. 
Data and Web Science Group
Cool Taxonomy Clustering Problem
– How to cluster 62,000 taxonomies in order to discover the 
most frequent categorization ideas within a domain?
– How do shops categorize action cameras?
– What subcategories are frequently used for action cameras?
49Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.26
Data and Web Science Group
Learn How to Match Product Taxonomies
using Schema.org Data as Supervision 
50
Category
A
Category
B
Taxonomies that share
some products Unseen categories
Similar category?
P1
P2
P1
P2
P3
Learn
Matcher
Matcher
Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.26
Data and Web Science Group
Conclusions
1. Internal Knowledge Graph Completion
– Interesting, but not ready for prime time yet
– Focus on link prediction, hardly any results on other tasks 
2. KG Completion using Small Sets of External Sources
– Production quality results for all tasks
– Manuel effort per source required
– Bottle neck is the availability and costs of external data sources
51Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.26
Data and Web Science Group
Conclusions
3. KG Completion using Web Table Data
– Wide coverage of long‐tail entities
– Wide coverage of long‐tail properties
– Difficult to get property semantics right (page context required)
4. KG Completion using Schema.org Annotations
– Comprehensive coverage of specific domains
– Potential as training data for different KG completion tasks
52Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.26
Data and Web Science Group
Thank you.
– Web Table Corpus
http://webdatacommons.org/webtables/ 
– Semantic Annotations
http://webdatacommons.org/structureddata/
– Product Matching Training Data and Goldstandard
http://webdatacommons.org/largescaleproductcorpus/v2/
53Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.26

Más contenido relacionado

Similar a JIST2019 Keynote: Completing Knowledge Graphs using Data from the Open Web

STI 2022 - Generating large-scale network analyses of scientific landscapes i...
STI 2022 - Generating large-scale network analyses of scientific landscapes i...STI 2022 - Generating large-scale network analyses of scientific landscapes i...
STI 2022 - Generating large-scale network analyses of scientific landscapes i...Michele Pasin
 
Describing Scholarly Contributions semantically with the Open Research Knowle...
Describing Scholarly Contributions semantically with the Open Research Knowle...Describing Scholarly Contributions semantically with the Open Research Knowle...
Describing Scholarly Contributions semantically with the Open Research Knowle...Sören Auer
 
NOVA Data Science Meetup 8-10-2017 Presentation - State of Data Science Educa...
NOVA Data Science Meetup 8-10-2017 Presentation - State of Data Science Educa...NOVA Data Science Meetup 8-10-2017 Presentation - State of Data Science Educa...
NOVA Data Science Meetup 8-10-2017 Presentation - State of Data Science Educa...NOVA DATASCIENCE
 
Jabes 2021 - Poster "Bibliothèques de recherche - Compétences en science ouve...
Jabes 2021 - Poster "Bibliothèques de recherche - Compétences en science ouve...Jabes 2021 - Poster "Bibliothèques de recherche - Compétences en science ouve...
Jabes 2021 - Poster "Bibliothèques de recherche - Compétences en science ouve...ABES
 
Data base management system LAB MANUAL KCS 551.pdf
Data base management system LAB MANUAL KCS 551.pdfData base management system LAB MANUAL KCS 551.pdf
Data base management system LAB MANUAL KCS 551.pdfVandanaTripathi32
 
Hala skafkeynote@conferencedata2021
Hala skafkeynote@conferencedata2021Hala skafkeynote@conferencedata2021
Hala skafkeynote@conferencedata2021hala Skaf
 
Using Graphs to Enable National-Scale Analytics
Using Graphs to Enable National-Scale AnalyticsUsing Graphs to Enable National-Scale Analytics
Using Graphs to Enable National-Scale AnalyticsNeo4j
 
The Generative AI System Shock, and some thoughts on Collective Intelligence ...
The Generative AI System Shock, and some thoughts on Collective Intelligence ...The Generative AI System Shock, and some thoughts on Collective Intelligence ...
The Generative AI System Shock, and some thoughts on Collective Intelligence ...Simon Buckingham Shum
 
Knowledge Graph for Cybersecurity: An Introduction By Kabul Kurniawan
Knowledge Graph for Cybersecurity: An Introduction By  Kabul KurniawanKnowledge Graph for Cybersecurity: An Introduction By  Kabul Kurniawan
Knowledge Graph for Cybersecurity: An Introduction By Kabul KurniawanKabul Kurniawan
 
Analysis on the Demand of Top Talent Introduction in Big Dat
Analysis on the Demand of Top Talent Introduction in Big DatAnalysis on the Demand of Top Talent Introduction in Big Dat
Analysis on the Demand of Top Talent Introduction in Big DatMadonnaJacobsenfp
 
Analysis on the Demand of Top Talent Introduction in Big Dat.docx
Analysis on the Demand of Top Talent Introduction in Big Dat.docxAnalysis on the Demand of Top Talent Introduction in Big Dat.docx
Analysis on the Demand of Top Talent Introduction in Big Dat.docxgreg1eden90113
 
Analysis on the Demand of Top Talent Introduction in Big Dat.docx
Analysis on the Demand of Top Talent Introduction in Big Dat.docxAnalysis on the Demand of Top Talent Introduction in Big Dat.docx
Analysis on the Demand of Top Talent Introduction in Big Dat.docxjack60216
 
Analysis on the Demand of Top Talent Introduction in Big Dat.docx
Analysis on the Demand of Top Talent Introduction in Big Dat.docxAnalysis on the Demand of Top Talent Introduction in Big Dat.docx
Analysis on the Demand of Top Talent Introduction in Big Dat.docxSHIVA101531
 
Analysis on the Demand of Top Talent Introduction in Big Dat.docx
Analysis on the Demand of Top Talent Introduction in Big Dat.docxAnalysis on the Demand of Top Talent Introduction in Big Dat.docx
Analysis on the Demand of Top Talent Introduction in Big Dat.docxdaniahendric
 
Construction and Querying of Dynamic Knowledge Graphs
Construction and Querying of Dynamic Knowledge GraphsConstruction and Querying of Dynamic Knowledge Graphs
Construction and Querying of Dynamic Knowledge GraphsSutanay Choudhury
 
Building Effective Visualization Shiny WVF
Building Effective Visualization Shiny WVFBuilding Effective Visualization Shiny WVF
Building Effective Visualization Shiny WVFOlga Scrivner
 
cs690l-syl.doc.doc
cs690l-syl.doc.doccs690l-syl.doc.doc
cs690l-syl.doc.docbutest
 
PPOL 650Issue Analysis Paper InstructionsYou will submit a p.docx
PPOL 650Issue Analysis Paper InstructionsYou will submit a p.docxPPOL 650Issue Analysis Paper InstructionsYou will submit a p.docx
PPOL 650Issue Analysis Paper InstructionsYou will submit a p.docxChantellPantoja184
 

Similar a JIST2019 Keynote: Completing Knowledge Graphs using Data from the Open Web (20)

FDS_dept_ppt.pptx
FDS_dept_ppt.pptxFDS_dept_ppt.pptx
FDS_dept_ppt.pptx
 
STI 2022 - Generating large-scale network analyses of scientific landscapes i...
STI 2022 - Generating large-scale network analyses of scientific landscapes i...STI 2022 - Generating large-scale network analyses of scientific landscapes i...
STI 2022 - Generating large-scale network analyses of scientific landscapes i...
 
Describing Scholarly Contributions semantically with the Open Research Knowle...
Describing Scholarly Contributions semantically with the Open Research Knowle...Describing Scholarly Contributions semantically with the Open Research Knowle...
Describing Scholarly Contributions semantically with the Open Research Knowle...
 
NOVA Data Science Meetup 8-10-2017 Presentation - State of Data Science Educa...
NOVA Data Science Meetup 8-10-2017 Presentation - State of Data Science Educa...NOVA Data Science Meetup 8-10-2017 Presentation - State of Data Science Educa...
NOVA Data Science Meetup 8-10-2017 Presentation - State of Data Science Educa...
 
Jabes 2021 - Poster "Bibliothèques de recherche - Compétences en science ouve...
Jabes 2021 - Poster "Bibliothèques de recherche - Compétences en science ouve...Jabes 2021 - Poster "Bibliothèques de recherche - Compétences en science ouve...
Jabes 2021 - Poster "Bibliothèques de recherche - Compétences en science ouve...
 
Data base management system LAB MANUAL KCS 551.pdf
Data base management system LAB MANUAL KCS 551.pdfData base management system LAB MANUAL KCS 551.pdf
Data base management system LAB MANUAL KCS 551.pdf
 
Hala skafkeynote@conferencedata2021
Hala skafkeynote@conferencedata2021Hala skafkeynote@conferencedata2021
Hala skafkeynote@conferencedata2021
 
Using Graphs to Enable National-Scale Analytics
Using Graphs to Enable National-Scale AnalyticsUsing Graphs to Enable National-Scale Analytics
Using Graphs to Enable National-Scale Analytics
 
The Generative AI System Shock, and some thoughts on Collective Intelligence ...
The Generative AI System Shock, and some thoughts on Collective Intelligence ...The Generative AI System Shock, and some thoughts on Collective Intelligence ...
The Generative AI System Shock, and some thoughts on Collective Intelligence ...
 
Hahn "Wikidata as a hub to library linked data re-use"
Hahn "Wikidata as a hub to library linked data re-use"Hahn "Wikidata as a hub to library linked data re-use"
Hahn "Wikidata as a hub to library linked data re-use"
 
Knowledge Graph for Cybersecurity: An Introduction By Kabul Kurniawan
Knowledge Graph for Cybersecurity: An Introduction By  Kabul KurniawanKnowledge Graph for Cybersecurity: An Introduction By  Kabul Kurniawan
Knowledge Graph for Cybersecurity: An Introduction By Kabul Kurniawan
 
Analysis on the Demand of Top Talent Introduction in Big Dat
Analysis on the Demand of Top Talent Introduction in Big DatAnalysis on the Demand of Top Talent Introduction in Big Dat
Analysis on the Demand of Top Talent Introduction in Big Dat
 
Analysis on the Demand of Top Talent Introduction in Big Dat.docx
Analysis on the Demand of Top Talent Introduction in Big Dat.docxAnalysis on the Demand of Top Talent Introduction in Big Dat.docx
Analysis on the Demand of Top Talent Introduction in Big Dat.docx
 
Analysis on the Demand of Top Talent Introduction in Big Dat.docx
Analysis on the Demand of Top Talent Introduction in Big Dat.docxAnalysis on the Demand of Top Talent Introduction in Big Dat.docx
Analysis on the Demand of Top Talent Introduction in Big Dat.docx
 
Analysis on the Demand of Top Talent Introduction in Big Dat.docx
Analysis on the Demand of Top Talent Introduction in Big Dat.docxAnalysis on the Demand of Top Talent Introduction in Big Dat.docx
Analysis on the Demand of Top Talent Introduction in Big Dat.docx
 
Analysis on the Demand of Top Talent Introduction in Big Dat.docx
Analysis on the Demand of Top Talent Introduction in Big Dat.docxAnalysis on the Demand of Top Talent Introduction in Big Dat.docx
Analysis on the Demand of Top Talent Introduction in Big Dat.docx
 
Construction and Querying of Dynamic Knowledge Graphs
Construction and Querying of Dynamic Knowledge GraphsConstruction and Querying of Dynamic Knowledge Graphs
Construction and Querying of Dynamic Knowledge Graphs
 
Building Effective Visualization Shiny WVF
Building Effective Visualization Shiny WVFBuilding Effective Visualization Shiny WVF
Building Effective Visualization Shiny WVF
 
cs690l-syl.doc.doc
cs690l-syl.doc.doccs690l-syl.doc.doc
cs690l-syl.doc.doc
 
PPOL 650Issue Analysis Paper InstructionsYou will submit a p.docx
PPOL 650Issue Analysis Paper InstructionsYou will submit a p.docxPPOL 650Issue Analysis Paper InstructionsYou will submit a p.docx
PPOL 650Issue Analysis Paper InstructionsYou will submit a p.docx
 

Más de Chris Bizer

GPT4 versus BERT: Which Foundation Model is better for Web Data Integration?
GPT4 versus BERT: Which Foundation Model is better for Web Data Integration?GPT4 versus BERT: Which Foundation Model is better for Web Data Integration?
GPT4 versus BERT: Which Foundation Model is better for Web Data Integration?Chris Bizer
 
Integrating Product Data from the Semantic Web using Deep Learning Techniques
Integrating Product Data from the Semantic Web using Deep Learning TechniquesIntegrating Product Data from the Semantic Web using Deep Learning Techniques
Integrating Product Data from the Semantic Web using Deep Learning TechniquesChris Bizer
 
Using the Semantic Web as Training Data for Product Matching
Using the Semantic Web as Training Data for Product MatchingUsing the Semantic Web as Training Data for Product Matching
Using the Semantic Web as Training Data for Product MatchingChris Bizer
 
Is the Semantic Web what we expected? Adoption Patterns and Content-driven Ch...
Is the Semantic Web what we expected? Adoption Patterns and Content-driven Ch...Is the Semantic Web what we expected? Adoption Patterns and Content-driven Ch...
Is the Semantic Web what we expected? Adoption Patterns and Content-driven Ch...Chris Bizer
 
Data Search and Search Joins (Universität Heidelberg 2015)
Data Search and Search Joins (Universität Heidelberg 2015)Data Search and Search Joins (Universität Heidelberg 2015)
Data Search and Search Joins (Universität Heidelberg 2015)Chris Bizer
 
Exploring the Application Potential of Relational Web Tables
Exploring the Application Potential of Relational Web TablesExploring the Application Potential of Relational Web Tables
Exploring the Application Potential of Relational Web TablesChris Bizer
 
Evolving the Web into a Global Dataspace – Advances and Applications
Evolving the Web into a Global Dataspace – Advances and ApplicationsEvolving the Web into a Global Dataspace – Advances and Applications
Evolving the Web into a Global Dataspace – Advances and ApplicationsChris Bizer
 
Extending Tables with Data from over a Million Websites
 Extending Tables with Data from over a Million Websites Extending Tables with Data from over a Million Websites
Extending Tables with Data from over a Million WebsitesChris Bizer
 
Adoption of the Linked Data Best Practices in Different Topical Domains
Adoption of the Linked Data Best Practices in Different Topical DomainsAdoption of the Linked Data Best Practices in Different Topical Domains
Adoption of the Linked Data Best Practices in Different Topical DomainsChris Bizer
 
Evolving the Web into a Global Database - Advances and Applications.
Evolving the Web into a Global Database - Advances and Applications. Evolving the Web into a Global Database - Advances and Applications.
Evolving the Web into a Global Database - Advances and Applications. Chris Bizer
 
Graph Structure in the Web - Revisited. WWW2014 Web Science Track
Graph Structure in the Web - Revisited. WWW2014 Web Science TrackGraph Structure in the Web - Revisited. WWW2014 Web Science Track
Graph Structure in the Web - Revisited. WWW2014 Web Science TrackChris Bizer
 
Search Joins with the Web - ICDT2014 Invited Lecture
Search Joins with the Web - ICDT2014 Invited LectureSearch Joins with the Web - ICDT2014 Invited Lecture
Search Joins with the Web - ICDT2014 Invited LectureChris Bizer
 
DBpedia - An Interlinking Hub in the Web of Data
DBpedia - An Interlinking Hub in the Web of DataDBpedia - An Interlinking Hub in the Web of Data
DBpedia - An Interlinking Hub in the Web of DataChris Bizer
 

Más de Chris Bizer (13)

GPT4 versus BERT: Which Foundation Model is better for Web Data Integration?
GPT4 versus BERT: Which Foundation Model is better for Web Data Integration?GPT4 versus BERT: Which Foundation Model is better for Web Data Integration?
GPT4 versus BERT: Which Foundation Model is better for Web Data Integration?
 
Integrating Product Data from the Semantic Web using Deep Learning Techniques
Integrating Product Data from the Semantic Web using Deep Learning TechniquesIntegrating Product Data from the Semantic Web using Deep Learning Techniques
Integrating Product Data from the Semantic Web using Deep Learning Techniques
 
Using the Semantic Web as Training Data for Product Matching
Using the Semantic Web as Training Data for Product MatchingUsing the Semantic Web as Training Data for Product Matching
Using the Semantic Web as Training Data for Product Matching
 
Is the Semantic Web what we expected? Adoption Patterns and Content-driven Ch...
Is the Semantic Web what we expected? Adoption Patterns and Content-driven Ch...Is the Semantic Web what we expected? Adoption Patterns and Content-driven Ch...
Is the Semantic Web what we expected? Adoption Patterns and Content-driven Ch...
 
Data Search and Search Joins (Universität Heidelberg 2015)
Data Search and Search Joins (Universität Heidelberg 2015)Data Search and Search Joins (Universität Heidelberg 2015)
Data Search and Search Joins (Universität Heidelberg 2015)
 
Exploring the Application Potential of Relational Web Tables
Exploring the Application Potential of Relational Web TablesExploring the Application Potential of Relational Web Tables
Exploring the Application Potential of Relational Web Tables
 
Evolving the Web into a Global Dataspace – Advances and Applications
Evolving the Web into a Global Dataspace – Advances and ApplicationsEvolving the Web into a Global Dataspace – Advances and Applications
Evolving the Web into a Global Dataspace – Advances and Applications
 
Extending Tables with Data from over a Million Websites
 Extending Tables with Data from over a Million Websites Extending Tables with Data from over a Million Websites
Extending Tables with Data from over a Million Websites
 
Adoption of the Linked Data Best Practices in Different Topical Domains
Adoption of the Linked Data Best Practices in Different Topical DomainsAdoption of the Linked Data Best Practices in Different Topical Domains
Adoption of the Linked Data Best Practices in Different Topical Domains
 
Evolving the Web into a Global Database - Advances and Applications.
Evolving the Web into a Global Database - Advances and Applications. Evolving the Web into a Global Database - Advances and Applications.
Evolving the Web into a Global Database - Advances and Applications.
 
Graph Structure in the Web - Revisited. WWW2014 Web Science Track
Graph Structure in the Web - Revisited. WWW2014 Web Science TrackGraph Structure in the Web - Revisited. WWW2014 Web Science Track
Graph Structure in the Web - Revisited. WWW2014 Web Science Track
 
Search Joins with the Web - ICDT2014 Invited Lecture
Search Joins with the Web - ICDT2014 Invited LectureSearch Joins with the Web - ICDT2014 Invited Lecture
Search Joins with the Web - ICDT2014 Invited Lecture
 
DBpedia - An Interlinking Hub in the Web of Data
DBpedia - An Interlinking Hub in the Web of DataDBpedia - An Interlinking Hub in the Web of Data
DBpedia - An Interlinking Hub in the Web of Data
 

Último

Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 

Último (20)

Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 

JIST2019 Keynote: Completing Knowledge Graphs using Data from the Open Web

  • 1. Data and Web Science GroupCompleting Knowledge Graphs using Data  from the Open Web 1Nov. 25‐27, 2019, Hangzhou, China Prof. Dr. Christian Bizer JIST2019
  • 2. Data and Web Science Group – Cross‐Domain Knowledge Graphs – Domain‐Specific Knowledge Graphs Knowledge Graphs Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.26 2 Product Knowledge Graphs Scholarly Knowledge Graphs Life Science  Knowledge Graphs Enterprise Knowledge Graphs
  • 3. Data and Web Science Group Knowledge Graphs are Incomplete Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.26 3 P1 P2 P3 P4 … PN ? ? ? … E1 ? ? ? ? ? ? ? E2 ? ? ? ? ? ? E3 ? ? ? ? … ? ? ? ? ? Em ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? … ? ? ? ? ? ? ? ? ? ? Class in KG P1 Property E1 Entity Fact Source: Luna Dong
  • 4. Data and Web Science Group 4 Types of Incompleteness P1 P2 P3 P4 … PN ? ? ? … E1 ? ? ? ? ? ? ? E2 ? ? ? ? ? ? E3 ? ? ? ? … ? ? ? ? ? Em ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? … ? ? ? ? ? ? ? ? ? ? Class in KG Relevant facts  are missing Relevant  properties  are missing  in schema Relevant entities are not  described Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.26
  • 5. Data and Web Science Group 5 Knowledge Graph Completion Tasks P1 P2 P3 P4 … PN ? ? ? … E1 ? ? ? ? ? ? ? E2 ? ? ? ? ? ? E3 ? ? ? ? … ? ? ? ? ? Em ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? … ? ? ? ? ? ? ? ? ? ? Link Prediction add object property values for  existing instances and properties Literal Slot Filling add datatype property values for  existing instances and properties Schema Expansion add and fill new properties Entity Expansion • add new entities • and their descriptions (values  for existing properties) Class in KG Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.26
  • 6. Data and Web Science Group Outline 1. Internal Knowledge Graph Completion 2. KG Completion using Small Sets of External Sources 3. KG Completion using Web Table Data – Profile of HTML Tables on the Web – Potential for KG Completion 4. KG Completion using Schema.org Annotations – Profile of Schema.org Annotations on Web – Potential for KG Completion 5. Conclusions 6Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.26
  • 7. Data and Web Science Group 1. Internal Knowledge Graph Completion – Predict missing facts by exploiting patterns in the graph – Classic value imputation applied to knowledge graphs  – Symbolic  methods – Rule learning, e.g. AIME, AnyBURL – Heutistics, e.g. SDType – Sub‐symbolic methods – Knowledge Graph Embeddings Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.26 7 Paulheim: Knowledge Graph Refinement: A Survey of Approaches and Evaluation Methods. SWJ 2017. Nguyen: An Overview of Embedding Models of Entities and Relationships for Knowledge Base Completion. ArXiv, 2019.
  • 8. Data and Web Science Group Knowledge Graph Embeddings: Hits@10 Results on Link Prediction Task  8Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.26 Nguyen: An Overview of Embedding Models of Entities and Relationships for Knowledge Base Completion. ArXiv, 2019.
  • 9. Data and Web Science Group Weaknesses of Current Research on Knowledge Graph Embeddings 1. Only link prediction covered – What about literal slot filling? Entity expansion? Schema expansion?  2. Evaluations self‐referencial – Hit@10 not relevant for KB completion – Hits@1 performance too bad for practical use – Accuracy evaluated on simplified dataset: FB13, 6 relations dropped! Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.26 9 Method Hits@1 on FB15K‐237 ComplEx (2018) 0.26 ConvE (2018) 0.24 TuckER (2019) 0.26 AnyBURL (2019) 0.23 (symbolic rule learner) Meilicke, et al.: Fine‐grained Evaluation of Rule‐ and Embedding‐based Systems for Knowledge Graph Completion.  Wang, Ruffiinelli: On Evaluating Embeddings Models for Knowledge Base Completion. RepL4NLP‐2019.
  • 10. Data and Web Science Group – Open License Sources – Commercial Sources 2. KG Completion Using a Small Set  of External Sources Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.26 10 Baike
  • 11. Data and Web Science Group Integration Process – Classic data integration  techniques apply – Small set of sources  – we can invest effort into the integration of each data source – write rules, label training data – quality depends on effort spend – All task are covered – link prediction, slot filling,  – entity expansion, schema expansion Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.26 11 Data Collection / Extraction Schema Mapping Data Translation Entity Matching Data Quality Assessment Data Fusion AnHai, Halevy, Ives: Principles of Data Integration. Morgan Kaufmann, 2012.
  • 12. Data and Web Science Group KG Completion Example – YAGO2 completed using Geonames – 7 million additonal entities from Geonames – 10 additional object properties (relations) – 320 million triples from Geonames – Accuracy of spartial relations (manual evaluation) Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.26 12 Hoffart, et al.: YAGO2: A spatially and temporally enhanced knowledge base extracted from Wikipedia. JWS 2013.
  • 13. Data and Web Science Group Challenges 1. Coverage of external sources – Relevant entities and relevant properties might not be covered – e.g. Wikipedia 2. Up‐to‐dateness of external sources – Are sources regularly maintained? – e.g. LOD cloud, research data  3. Cost of external sources – Complete and up‐to‐date data often costs money – Large companies can pay, other organizations often not Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.26 13
  • 14. Data and Web Science Group Let‘s use the Open Web Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.26 14
  • 15. Data and Web Science Group Let‘s use the Open Web – Potentials  1. Coverage of long‐tail entities 2. Coverage of long‐tail properties 3. Content often up‐to‐date – Challenges 1. Heterogeneity 2. Data quality depends on website Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.26 15
  • 16. Data and Web Science Group Web Tables 16Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.26
  • 17. Data and Web Science Group Schema.org Annotations 17 <div  itemtype="http://schema.org/Hotel"> <span itemprop="name">Vienna Marriott Hotel</span> <span itemprop="address" itemtype="http://schema.org/PostalAddress"> <span itemprop="streetAddress">Parkring 12a</span> <span itemprop="addressLocality">Vienna</span> </span> </div> <div  itemtype="http://schema.org/Product"> <h1 itemprop="name">Signature Instant Dome 7</h1> Item # <span itemprop=“productID">200734246807</span> <img itemprop=“image“ src=“../20073424680.jpg"> <span itemprop=“review" itemtype="http://schema.org/Review"> <span itemprop=“ratingValue">5</span> <span itemprop=“reviewBody">Great waterproof tent!</span> </span> </div> Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.26
  • 18. Data and Web Science Group 3. Web Tables 18 Cafarella, et al.: WebTables: Exploring the Power of Tables on the Web. VLDB 2008. Crestan, Pantel: Web‐Scale Table Census and Classification. WSDM 2011. In corpus of 14B raw tables, 154M are relational tables (1.1%). Cafarella et al. 2008 Our focus Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.26
  • 19. Data and Web Science Group Web Data Commons – Web Tables Corpus 19 Lehmberg, Ritze, Meusel, Bizer: A Large Public Corpus of Web Tables containing Time and Context Metadata.  WWW2016 Companion. – extracted from Common Crawl 2015 – 1.78 billion pages – 10.2 billion raw HTML tables – Size of the web table corpus – 90 million relational tables (0.9%) – 139 million attribute/value tables (1,3%) – uses 100 machines on Amazon EC2  – approx. 2000 machine/hours  250 € – Download – http://webdatacommons.org/webtables/  Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.26
  • 20. Data and Web Science Group Task 1: Slot Filling and Link Prediction 20 A1 A2 A3 A4 … AN E1 ? ? ? E2 ? ? E3 … ? Em ? ? ? ? DBpedia Class Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.26
  • 21. Data and Web Science Group Two Step Process 21 1. Matching 2. Data Fusion IATA Code Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.26
  • 22. Data and Web Science Group Holistic Entity‐ and Schema Matching 22 – T2K+ matches – millions of web tables to the – DBpedia knowledge base – Exploits  – wide range of features – synergies between matching tasks – table corpus for indirect matching – patterns from KB for post filtering – Results Dominique Ritze, Christian Bizer: Matching Web Tables To DBpedia – A Feature Utility Study. EDBT 2017. Task Precision Recall F1 Class .94 .94 .94 Instance .90 .76 .82 Property .77 .65 .70 Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.26
  • 23. Data and Web Science Group ISWC2019 Challenge on Tabular Data  to Knowledge Graph Matching – Multiple tasks – CPA task = property matching Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.26 23 http://www.cs.ox.ac.uk/isg/challenges/sem‐tab/
  • 24. Data and Web Science Group Large‐Scale Matching Results  24 – 1 million out of 30 million tables match DBpedia (~3%) – 301,450 matching tables have attribute correspondences (~1%)  – Results in 8 million triples describing 717,000 DBpedia entities  Ritze, et al.: Profiling the Potential of Web Tables for Augmenting Cross‐domain Knowledge Bases. WWW2016. Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.26
  • 25. Data and Web Science Group – Fusion Results – Group Sizes – Large groups (Head): many values  already exist in the KB – Small groups (Tail): often new  value but difficult to fuse correctly Data Fusion Results 25 Fusion Strategy Precision Recall F1 Voting .369 .823 .509 PageRank .365 .814 .504 Knowledge-based Trust .639 .785 .705 Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.26
  • 26. Data and Web Science Group What is going wrong?  Topical Mismatch 26 Websites contributing many tables # Tables Topic apple.com 50,910 Music baseball‐reference.com 25,647 Sports latestf1news.com 17,726 Sports nascar.com 17,465 Sports amazon.com 16,551 Products wikipedia.org 13,993 Various inkjetsuperstore.com 12,282 Products flightmemory.com 8,044 Flights windshieldguy.com 7,305 Products citytowninfo.com 6,293 Cities blogspot.com 4,762 Various 7digital.com 4,462 Music Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.26
  • 27. Data and Web Science Group What is going wrong? Attribute Semantics 27 https://www.bls.gov/oes/2017/may/oes_ca.htm Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.26
  • 28. Data and Web Science Group – Matching and fusion methods expect binary relations  – But many relations in Web tables are not binary! 28 Misinterpretation of Attribute Semantics Occupation code Annual  wage 11‐3021 $128,805 15‐0000 $113,689 15‐1111 $39,180 Occupation code Annual  wage 11‐3021 $93,805 15‐0000 $83,472 15‐1111 $39,180 Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.26
  • 29. Data and Web Science Group Synthesizing N‐ary Relations from Web  Tables – Exploiting page context: Title and URL – Stitching tables from the same site 29 Oliver Lehmberg, Christian Bizer: Synthesizing N‐ary Relations from Web Tables. WIMS2019. Oliver Lehmberg, Christian Bizer: Stitching web tables for improving matching quality. VLDB Endowment, 2017. Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.26
  • 30. Data and Web Science Group Amount of N‐ary Relations 30 Oliver Lehmberg, Christian Bizer: Profiling the Semantics of N‐ary Web Table Data. SBD Workshop @ SIGMOD 2019. Results for the WDC2015 corpus: – 37.50% of all relations are non‐binary – 50.47% of these require context attributes Time dimension is often part of keys but also many  other attributes Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.26
  • 31. Data and Web Science Group Task 2: Add Long‐Tail Entities  to Knowledge Graph 31 A1 A2 A3 A4 … AN ? ? ? … E1 ? ? ? ? ? ? ? E2 ? ? ? ? ? ? E3 ? ? ? ? … ? ? ? ? ? Em ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? … ? ? ? ? ? ? ? ? ? ? Entity Expansion: 1. Find new entities 2. Compile descriptions of  these new entities Class in KB Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.26
  • 32. Data and Web Science Group Long‐Tail Entity Expansion Pipeline 32 Row  Clustering Entity  Creation New  Detection Knowledge  base New entities added to  knowledge base Schema  Matching Web tables Output of first iteration used to refine the  schema mapping in a second iteration Yaser Oulabi, Christian Bizer: Extending cross‐domain knowledge bases with long tail entities using web table  data. EDBT 2019. Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.26 1. Cluster rows that describe the same instance together 2. Create entities by fusing rows in clusters 3. Determine which entities describe new instances by comparing then to instances in KB
  • 33. Data and Web Science Group Entity Expansion Results 33 Class New Entities New Facts New Entities  Precision Goldst. New Facts  Precision Goldst. Football Player 13,983 (+67%) 43,800 (+32%) 0.82 0.85 Song 186,943 (+356%) 393,711 (+125%) 0.72 0.79  ‐  50.000  100.000  150.000  200.000  250.000 Football Player Song Existing New Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.26 DBpedia has low coverage for songs,  thus it is easy to find new ones
  • 34. Data and Web Science Group Conclusion: Web Tables – Extremely wide range of very specific attributes – many n‐ary relations with complex keys / time dimension – table context is often required to understand the data – Coverage of long tail entitles beyond Wikipedia – Things work OK, but not good enough yet – fact precision slot filling: 0.64 (without stitching) – new entities accuracy:  0.72‐0.82 – For researchers: Rewarding but challenging domain – For practitioners: Human in the loop still required for  verification 34Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.26
  • 35. Data and Web Science Group 4. Schema.org Annotations 35 – ask site owners since 2011 to  annotate data for enriching  search results – 675 Types: Event, Place, Local  Business, Product, Review, Person   – Encoding: Microdata, RDFa, JSON‐LD Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.26
  • 36. Data and Web Science Group Usage of Schema.org Data Data snippets within search results Local businesses on maps Data snippets in knowledge panels Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.26
  • 37. Data and Web Science Group Web Data Commons – Structured Data 37 – extracts all Microformat, Microdata,  RDFa, JSON‐LD data from the Common Crawl – analyzes and provides the extracted data for download – statistics of some extraction runs – 2018 CC Corpus: 2.5 billion HTML pages  31.5 billion RDF triples – 2017 CC Corpus: 3.1 billion HTML pages  38.2 billion RDF triples – 2014 CC Corpus: 2.0 billion HTML pages  20.4 billion RDF triples – 2010 CC Corpus: 2.8 billion HTML pages  5.1 billion RDF triples – Download – http://webdatacommons.org/structureddata/ Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.26
  • 38. Data and Web Science Group Overall Adoption 2018 http://webdatacommons.org/structureddata/2018‐12/ 38 944 million HTML pages out of the 2.5 billion pages provide semantic annotations (37.1%). 9.6 million pay-level-domains (PLDs) out of the 32.8 million pay-level-domains covered by the crawl provide semantic annotations (29.3%). Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.26
  • 39. Data and Web Science Group Frequently used Schema.org Classes 39 Top Classes # Websites (PLDs) Microdata JSON‐LD schema:WebPage 1,124,583 121,393 schema:Product 812,205 40,169 schema:Offer 676,899 57,756 schema:BreadcrumbList 621,344 205,971 schema:Article 612,361 57,082 schema:Organization 510,069 1,349,775 schema:PostalAddress 502,615 176,500 schema:ImageObject 360,875 111,946 schema:Blog 337,843 12,174 schema:Person 324,349 335,784 schema:LocalBusiness 294,390 249,017 schema:AggregateRating 258,078 23,105 schema:Review 124,022 6,622 schema:Place 92,127 66,396 schema:Event 88,130 63,605 http://webdatacommons.org/structureddata/2018‐12/  Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.26
  • 40. Data and Web Science Group Entities in WDC Dataset for KG Completion 40 http://webdatacommons.org/structureddata/2018‐12/stats/schema_org_subsets.html Class # Entities Names Product 368,386,801 Organization 176,365,220 Local Business 44,100,625 Place 31,748,612 Event 26,035,295 Movie 5,233,520 Hotel 2,537,089 College or University 2,392,797 Recipe 2,036,417 Music Album 1,594,941 Restaurant 1,480,443 Airport 1,145,323 as Common Crawl is rather shallow Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.26
  • 41. Data and Web Science Group Use Case 1: Completing KG Describing Books and Movies – Results of KnowMore experiments – Entity matching F1: book 0.88, movie 0.83 – New fact precision: book 0.88, movie 0.95 – Density increase for properties of existing entites Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.26 41 Yu, et al.: KnowMore–knowledge base augmentation with structured web markup. SWJ 2019. Movies Books
  • 42. Data and Web Science Group 42 Top Attributes PLDs Microdata # % schema:Product/name 754,812 92 % schema:Product/offers 645,994 79 % schema:Offer/price 639,598 78 % schema:Offer/priceCurrency 606,990 74 % schema:Product/image 573,614 70 % schema:Product/description 520,307 64 % schema:Offer/availability 477,170 58 % schema:Product/url 364,889 44 % schema:Product/sku 160,343 19 % schema:Product/aggregateRating 141,194 17 % schema:Product/brand 113,209 13 % schema:Product/category 62,170 7 % schema:Product/productID 47,088 5 % … … … http://webdatacommons.org/structureddata/2018‐12/stats/html‐md.xlsx Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.25 Das Samsung Galaxy S4 ist der unterhaltsame und hilfreiche Begleiter für Ihr mobiles Leben. Es verbindet Sie mit Ihren Liebsten. Es lässt Sie gemeinsam unvergessliche Momente erleben und festhalten. Es vereinfacht Ihren Alltag. UPC 610214632623 000214632623 Use Case 2: Completing Product KGs
  • 43. Data and Web Science Group HTML Tables Containing Product Specifications 43 Qui, et al.: DEXTER: Large‐Scale Discovery and Extraction of Product Specifications on the Web. VLDB 2015. Petrovski, et al: The WDC Gold Standards for Product Feature Extraction and Product Matching. ECWeb 2016. s:name Specifications as HTML Table s:breadcrumb Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.25
  • 44. Data and Web Science Group Use Product ID Annotations as Supervision for Product Matching – Some e‐shops annotate product IDs – Most e‐shops do not  44 Properties PLDs # % schema:Product/name 754,812 92 % schema:Product/description 520,307 64 % schema:Product/sku 160,343 19 % schema:Product/productID 47,088 5 % schema:Product/mpn 12,882 1.6% schema:Product/gtin13 7,994 1% Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.26 Bizer, Primpeli, Peeters: Using the Semantic Web as a Source of Training Data. Datenbank Spektrum, 2019.
  • 45. Data and Web Science Group Combine Specification Table Content  from All Offers for the Same Product 45 Product offer Product offer Product offer Product offerProduct offerProduct offer Product offer Product offer Product offer Product offer Product offer Product offer Clusters of offers having the same product ID Match additional offers without IDs Learn Matcher Matcher Product offer Product offer Combine spec table content of matching offers Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.26
  • 46. Data and Web Science Group Use Case 3: Distant Supervision for  Information Extraction – Goal: Learn how to extract data from websites that do  not provide schema.org annotations using annotations as  training data. – Example: Small Events Paper @ SIGIR 2015 – extract data about small venue concerts, theatre performances,  garage sales, etc.  – they use 217,000 schema.org events from ClueWeb2012  as distant supervision, e.g. 46 Foley, Bendersky, Josifovski: Learning to Extract Local Events from the Web. SIGIR2015. <div  itemtype="http://schema.org/MusicEvent"> <h1 itemprop="name">School Cool Concert</h1> <p itemprop="startDate">25.09.2018 15:00</p> <p itemprop=“location" itemtype="http://schema.org/Place">School Gym Puero High</span> <div> Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.26
  • 47. Data and Web Science Group Small Events Paper @ SIGIR 2015 – Approach – train an information extraction method that combines structural and text features – extracted attributes: What? When? Where? – apply method to pages without semantic annotations – Result  – extract 200,000 additional small‐scale events from ClueWeb2012  – double recall at 0.85 precision 47 Foley, Bendersky, Josifovski: Learning to Extract Local Events from the Web. SIGIR2015. Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.26
  • 48. Data and Web Science Group Use Case 4: Training Data for Taxonomy  Matching and Taxonomy Clustering – E‐Shops annotate their product categorization – The marketing ideas behind these categorizations differ widely 48 Relevant Properties # PLDs % schema:Offer/category 62,170 7 % of all shops schema:WebPage/BreadcrumbList 621,344 11% of all sites Apparel > Summer > Mens > Jerseys Philadelphia Eagles > Philadelphia Eagles Mens > Philadelphia Eagles Mens Jersey > over $60 s Home > Outdoor & Garden > Barbecues & Outdoor Living > Garden Furniture > Tables Shop > Tables > Dining Tables > Aland Wood Tables Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.26 Meusel, etal.: Exploiting Microdata Annotations to Consistently Categorize Product Offers at Web Scale. EC‐Web 2015. 
  • 49. Data and Web Science Group Cool Taxonomy Clustering Problem – How to cluster 62,000 taxonomies in order to discover the  most frequent categorization ideas within a domain? – How do shops categorize action cameras? – What subcategories are frequently used for action cameras? 49Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.26
  • 50. Data and Web Science Group Learn How to Match Product Taxonomies using Schema.org Data as Supervision  50 Category A Category B Taxonomies that share some products Unseen categories Similar category? P1 P2 P1 P2 P3 Learn Matcher Matcher Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.26
  • 51. Data and Web Science Group Conclusions 1. Internal Knowledge Graph Completion – Interesting, but not ready for prime time yet – Focus on link prediction, hardly any results on other tasks  2. KG Completion using Small Sets of External Sources – Production quality results for all tasks – Manuel effort per source required – Bottle neck is the availability and costs of external data sources 51Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.26
  • 52. Data and Web Science Group Conclusions 3. KG Completion using Web Table Data – Wide coverage of long‐tail entities – Wide coverage of long‐tail properties – Difficult to get property semantics right (page context required) 4. KG Completion using Schema.org Annotations – Comprehensive coverage of specific domains – Potential as training data for different KG completion tasks 52Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.26
  • 53. Data and Web Science Group Thank you. – Web Table Corpus http://webdatacommons.org/webtables/  – Semantic Annotations http://webdatacommons.org/structureddata/ – Product Matching Training Data and Goldstandard http://webdatacommons.org/largescaleproductcorpus/v2/ 53Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.26