Beyond Linked Data - Exploiting Entity-Centric Knowledge on the Web

Beyond Linked Data –
Exploiting Entity-Centric Knowledge on the Web
Stefan Dietze
L3S Research Center, Hannover, Germany
- Linked Data on the Web (LDOW2017), WWW2017 -
05/04/17 1Stefan Dietze

Research areas
 Web science, Information Retrieval, Semantic Web, Social Web
Analytics, Knowledge Discovery, Human Computation
 Interdisciplinary application areas: digital humanities,
TEL/education, Web archiving, mobility, ...
Some projects
Research @ L3S
05/04/17 2
 See also: http://www.l3s.de
Stefan Dietze

Acknowledgements: team
 Pavlos Fafalios (L3S)
 Besnik Fetahu (L3S)
 Elena Demidova (L3S)
 Ujwal Gadiraju (L3S)
 Eelco Herder (L3S)
 Ivana Marenzi (L3S)
 Nicolas Tempelmeier (L3S)
 Ran Yu (L3S)
 Nilamadhaba Mohapatra (L3S, IIT India)
 Bernardo Pereira Nunes (L3S, PUC Rio de Janeiro)
 Mathieu d‘Aquin (The Open University, UK)
 Mohamed Ben Ellefi (LIRMM, France)
 Davide Taibi (CNR, Italy)
 Konstantin Todorov (LIRMM, France)
 ...

Back in September 2016
A new look at the semantic web. Abraham
Bernstein, James Hendler, Natalya Noy,
Communications of the ACM, Vol. 59 No. 9, Pages 35-
37, September 2016
Retrieval, Crawling and Fusion of Entity-centric Data
on the Web, Dietze, S., in Semantic Keyword-Based
Search on Structured Data Sources, In: Calì A., Gorgan
D., Ugarte M. (eds) Semantic Keyword-Based Search on
Structured Data Sources. KEYSTONE 2016. LNCS, Vol
10151. Springer, 2017.

Overview
05/04/17Stefan Dietze 6
I – Challenges
II – Enabling discovery & search in Linked Data & Knowledge Graphs
 Dataset recommendation
 Dataset profiling
 Entity retrieval
III – Beyond Linked Data – exploiting embedded Web semantics
 Web markup as emerging data source
 Case studies
 Data fusion for entity reconciliation (and retrieval)
III Wrap-up
Other emerging forms of
semantics/structured data on
the Web („Future“)
Dealing with heterogeneity &
shortcomings („Present“)

Data accessibility & quality?
SPARQL endpoint availability over time [Buil-Aranda et al 2013]
Accessibility of (linked) datasets?
 Less than 50% of all SPARQL endpoints actually responsive at given point of time [Buil-Aranda2013]
 “THE” SPARQL protocol? No, but variants, subsets and local restrictions
Semantics, links, quality?
 …data accuracy (eg DBpedia)? [Paulheim2013]
 …schema compliance & evolution [HoganJWS2012]
 …vocabulary reuse? [D’AquinWebSci13]
Stefan Dietze
Assessing the Educational Linked Data Landscape, D’Aquin, M., Adamou, A.,
Dietze, S., ACM Web Science 2013 (WebSci2013), Paris, France, May 2013.
Type Inference on Noisy RDF Data, Paulheim H., Bizer, C. Semantic Web – ISWC
2013, Lecture Notes in Computer Science Volume 8218, 2013, pp 510-525
An empirical survey of Linked Data conformance. Hogan, A., Umbrich, J., Harth,
A., Cyganiak, R., Polleres, A., Decker., S., Journal of Web Semantics 14, 2012
05/04/17 7
SPARQL Web-Querying Infrastructure: Ready for Action?, Carlos Buil-
Aranda, Aidan Hogan, Jürgen Umbrich Pierre-Yves Vandenbussch, International
Semantic Web Conference 2013, (ISWC2013).

Co-occurence of
types
(in 146 datasets:
144 vocabularies,
588 overlapping
types, 719
predicates)
Assessing the Educational Linked Data Landscape,
D’Aquin, M., Adamou, A., Dietze, S., ACM Web
Science 2013 (WebSci2013), Paris, May 2013.
po:Programme
yov:Video
?
bibo:Book
Vocabulary reuse/linking?

typeX
typeX
Co-occurence after
mapping
(201 frequently
occuring types,
mapped into 79 types)
bibo:Film
bibo:Document
po:Programme
bibo:Book
foaf:Document
yov:Video
typeX
Co-occurence of
types
(in 146 datasets:
144 vocabularies,
588 overlapping
types, 719
predicates)
05/04/17 9
Vocabulary reuse/linking? Assessing the Educational Linked Data Landscape,
D’Aquin, M., Adamou, A., Dietze, S., ACM Web
Science 2013 (WebSci2013), Paris, May 2013.

“Completeness” ?
 Example: varying completeness of “book” (“movie”) entity
descriptions
 Missing facts: 49.8% (37.1%) in DBpedia, 63.8% (23.3%) in
Freebase and 60.9 % (40%) in Wikidata
(varies heavily across attributes)
Yu, R., Fetahu, B., Gadiraju, U., Dietze, S., FuseM:
Query-Centric Data Fusion on Structured Web
Markup, ICDE2017.
Yu, R., Fetahu, B., Gadiraju, U., Lehmberg, O., Ritze,
D., Dietze, S., KnowMore - Knowledge Base
Augmentation with Structured Web Markup,
Semantic Web Journal 2017, under review.

Consistency? Analyzing Relative Incompleteness of Movie
Descriptions in the Web of Data: A Case Study,
Yuan, W., Demidova, E., Dietze, S., Zhu, X.,
ISWC2014

Challenge for search/retrieval – heterogeneity of datasets & entities
Stefan Dietze 05/04/17
??? ?? ?
Discovery of suitable (1) datasets & (2) entities:
 Quality? Currentness, dynamics, accessability/reliability,
data quantity & quality?
 Topics/scope? Datasets/entities useful & trustworthy for
topic XY?
 Types? Datasets/entities about statistics, organisations,
videos, slides, publications etc?
12

Overview
I – Challenges
 Case studies
III Wrap-up
semantics/structured data on
the Web („Future“)
shortcomings („Now“)

05/04/17
Dataset recommendation I
14
S
Linkset1
Linkset2
Approach
 Given dataset s, ranking datasets from D
according to probability score (di, t) to
contain linking candidates (entities)
 Features:
 Approach 1: vocabulary overlap
 Approach 2: existing links (SNA)
 Linking candidates likely if datasets share
common (a) schema elements, or (b) links
(friend of a friend)
Conclusions
 Roughly 50% MAP for both approaches
 Simplistic approach (!)
Lopes, G.R., Paes Leme, L. A., Nunes, B.P., Casanova,
M.A., Dietze, S., Two approaches to the dataset
interlinking recommendation problem, 15th
International Conference on Web Information System
Engineering (WISE 2014), Thessaloniki, Greece.
Rank
1 DBLP
2 ACM
3 OAI
4 CiteSeer
5 IBM
6 Roma
7 IEEE
8 Ulm
9 Pisa
?
?
Stefan Dietze 14
Goal: finding candidate datasets, e.g. for entity retrieval
or interlinking tasks (eg enrichment)

Dataset recommendation II
05/04/17
Ben Ellefi, M., Bellahsene, Z., Dietze, S., Todorov, K.,
Intension-based Dataset Recommendation for Data
Linking, 13th Extended Semantic Web Conference
(ESWC2016), Heraklion, Crete, May, 2016, ESWC2016
Stefan Dietze 15
L. Han, A. L. Kashyap, T. Finin, J. Mayeld, and J. Weese, "Umbc ebiquity-core: Semantic textual similarity systems", in Proc. of the *SEM, Association for Computational Linguistics, 2013.
Preprocessing Datasets rankingDatasets filtering

Dataset recommendation II: results
Data & ground truth
 Experiments on (responsive) datasets
from LOD Cloud (http://datahub.io)
 Concept profiles from
http://lov.okfn.org
 Ground truth: existing links from VOID
profiles of datasets
(issue: not always representative for
actual linksets)
Results
 MAP for different similarity thresholds
from step 2 max. 54% (UMBC@0.7)
 Recall 100% below indicated similarity
(clustering) thresholds
Ben Ellefi, M., Bellahsene, Z., Dietze, S., Todorov, K.,
Intension-based Dataset Recommendation for Data
Linking, 13th Extended Semantic Web Conference
(ESWC2016), Heraklion, Crete, May, 2016, ESWC2016

Dataset search through dataset cataloging & profiling
Dataset
Catalog/Registry
http://data.linkededucation.org/linkedup/catalog/
 LinkedUp project (FP7 project: L3S, OU, OKFN, Elsevier, Exact Learning solutions)
 LinkedUp Catalog: largest collection of LD of educationally relevant resources (approx. 50 Datasets)
 Original datasets published with key content providers, automatically extracted metadata

LinkedUp Catalog: dataset index & registry, federated search
 “Federated queries” through schema mappings [WebSci13]
 Dataset accessibility
 Linking & topic profiling
Schema/Types

LinkedUp Catalog: dataset index & registry, federated search
 “Federated queries” through schema mappings [WebSci13]
 Dataset accessibility
 Linking & topic profiling [ESWC14]
Dataset topic
profiles

db:Biology
db:Cell biology
Dataset
Catalog/Registry
yov:Video
<yo:Video …>
<dc:title>Lecture 29 –
Stem Cells</dc:title>
…
</yo:Video…>
Yovisto Video
 Extraction of representative (DBpedia) categories („topic profile“) for arbitrary datasets ?
 Technically trivial through established NER/NED approaches, but scalability issues
(recall: LOD Cloud 1000+ datasets with <100 billion RDF statements)
 Efficient approach: sampling & ranking for balance between scalability and precision /recall
Scalable profiling of datasets
A Scalable Approach for Efficiently Generating
Structured Dataset Topic Profiles, Fetahu, B.,
Dietze, S., Nunes, B. P., Casanova, M. A., Nejdl,
W., 11th Extended Semantic Web Conference
(ESWC2014), Crete, Greece, (2014).
db:Cell
(Biology)
05/04/17 21
db:Cell
(Biology)
Stefan Dietze

Efficient dataset profiling
1. Sampling of resources
(random sampling, weighted sampling, resource
centrality sampling)
2. Entity- & topic-extraction (NER via DBpedia Spotlight,
category mapping & -expansion)
3. Normalisation & ranking (graph-based models such as
PageRank with Priors, HITS with Priors & K-Step Markov)
 Result: weighted dataset-topic profile graph
A Scalable Approach for Efficiently Generating
Structured Dataset Topic Profiles, Fetahu, B.,
Dietze, S., Nunes, B. P., Casanova, M. A., Nejdl,
W., 11th Extended Semantic Web Conference
(ESWC2014), Crete, Greece, (2014).

Search & exploration of datasets through topic profiles
 Applied to entire LOD cloud/graph
 Visual exploration of extracted RDF dataset profiles
(datasets, topics, relationships)
 Evaluation results: K-Step Markov (10% sampling size)
outperforms baselines (LDA, tf/idf on entire datasets)
http://data-observatory.org/lod-profiles/

Search: entity retrieval on large LD crawls?
 How to efficiently retrieve (related) entities/resources for given entity-seeking (keyword) query?
 State of the art: BM25F on inverted entity index (Blanco et al, ISWC2011)
 Challenges/observations:
 Explicit entity links (owl:sameAs etc) are sparse yet important to facilitate state of the art methods
 Query type affinity?
??
Large dataset/crawl
e.g. LinkedUp dataset graph, BTC2014, Dynamic LD Observatory
entities related to <Tim Berners Lee>
?
BTC2014
DyLDO

Entity retrieval: approach
(I) Offline processing (clustering to address link sparsity)
1. Feature vectors (lexical and structural features)
2. Bucketing: per type (LSH algorithm)
3. Clustering: X-means & Spectral clustering per bucket
Improving Entity Retrieval on Structured Data,
Fetahu, B., Gadiraju, U., Dietze, S., 14th
International Semantic Web Conference
(ISWC2015), Bethlehem, US, (2015).
(II) Online processing (retrieval)
1. Retrieval & expansion:
a) BM25F results
b) expansion from clusters (related entities)
2. Re-Ranking
(context terms & query type affinity)

Dataset
 BTC2014 (4 billion entities)
 92 SemSearch queries
Methods
 Our approaches: XM: Xmeans, SP: Spectral
 Baselines B: BM25F, S1: Tonon et al [SIGIR12]
Conclusions
 XM & SP outperform baselines
 Clustering to remedy link sparsity
(yet extensive offline processing required)
 Relevance to query more important than
relevance to BM25F results
Entity retrieval: evaluation
Improving Entity Retrieval on Structured Data,
Fetahu, B., Gadiraju, U., Dietze, S., 14th
International Semantic Web Conference
(ISWC2015), Bethlehem, US, (2015).

PROFILES2017 - Profiling & search of Linked Data
https://profiles2017.wordpress.com/
• Probably co-located with ISWC2017 (Vienna)
• Submissions due 21 June

Overview
I – Challenges
 Case studies
III Wrap-up
structured data on the Web
(„Future“)?
shortcomings („Present“)

 Linked Data: approx.
1000+ datasets & 100 billion statements
 Open Data: XXX datasets
Web semantics & entity-centric Web data
 Web (of documents):
approx. 46.000.000.000.000 (46 trillion)
Web pages indexed by Google
 Other forms of Web semantics
and entity-centric knowledge?
 Dynamics?
 Quality?
 Accessibility?
 Scale?

 Embedded markup (RDFa, Microdata, Microformats) for
interpretation of Web documents (search, retrieval)
 Arbitrary vocabularies; schema.org used at scale:
(700 classes, 1000 predicates)
 Adoption on the Web: 26 %
(2014 Google study of 12 bn Web pages)
 “Web Data Commons” (Meusel & Paulheim [ISWC2014])
• Markup from Common Crawl (3.2 billion pages):
44 billion RDF quads (2016)
• Markup in 38% of pages in 2016
 Same order of magnitude as “the Web” (!)
Embedded Web page markup & schema.org
<div itemscope itemtype ="http://schema.org/Movie">
<h1 itemprop="name">Forrest Gump</h1>
<span>Actor: <span itemprop=„actor">Tom Hanks</span>
<span itemprop="genre">Drama</span>
...
</div>
05/04/17 30
RDF statements
node1 actor _node-x
node1 actor Robin Wright
node1 genre Comedy
node2 actor T. Hanks
node2 distributed by Paramount Pic.
node3 actor Tom Cruise
node3 distributed by Paramount Pic.
Stefan Dietze
http://webdatacommons.org

 schema:Product instances in WDC2015
 Facts: 1.414.937.431
(= 302.246.120 instances, i.e. products)
 Providers (distinct Pay Level Domains, PLDs): 93.705
 Power law distribution of terms across PLDs
 Top 10 PLDs
 Top provider ? (company)
Example: embedded Web markup about „products“
PLD # Resources
www.crateandbarrel.com 33.517.936,00
www.bentgate.com 17.215.499,00
www.aliexpress.com 9.621.943,00
www.ebay.com.au 8.861.308,00
us.fotolia.com 7.939.982,00
www.ebay.co.uk 6.556.820,00
www.competitivecyclist.com 6.214.500,00
www.maxstudio.com 6.075.626,00
approx. 35 million resources

1
10
100
1000
10000
100000
1000000
10000000
1 51 101 151 201
count(log)
PLD (ranked)
# entities # statements
Study on sample Web crawl (WDC2015)
 Metadata about scholarly articles, e.g.
s:ScholarlyArticle): 6.793.764 quads, 1.184.623
entities, 429 distinct predicates
(in WDC and for 1 type alone)
 Top 5 domains: Springer, MDPI, BMJ,
mendeley.com, Biodiversitylibrary.org
Domains, topics, disciplines?
 Life Sciences and Computer Science predominant
 Top-10 article titles
 Noise
Example: markup of bibliographic resources
Sahoo, P., Gadiraju, U., Yu, R., Saha, S., Dietze, S.,
Analysing Structured Scholarly Data embedded in Web
Pages, SAVE-SD2016, co-located with the WWW2016

Example: markup of learning resources on the Web
 “Learning Resources Metadata Intiative (LRMI)”:
schema.org vocabulary for annotation of learning
resources
 Developed through DCMI Task Force on LRMI
 Approx. 5000 PLDs (incl. subdomains) in CC
 LRMI adoption (WDC) [WWW17]:
 2015: 44,108,511 quads
 2014: 30,599,024 quads
 2013: 10.636873 quads
05/04/17 33
Dietze, S., Taibi, D., Yu, R., Barker, P., d’Aquin, M., Analysing and
Improving embedded Markup of Learning Resources on the
Web, 26th International World Wide Web Conference
(WWW2017), Digital Learning track, Perth, April 2017.
Stefan Dietze

Example: markup of learning resources on the Web
 “Learning Resources Metadata Intiative (LRMI)”:
schema.org vocabulary for annotation of learning
resources
 Developed through DCMI Task Force on LRMI
 Approx. 5000 PLDs (incl. subdomains) in CC
 LRMI adoption (WDC) [WWW17]:
 2015: 44,108,511 quads
 2014: 30,599,024 quads
 2013: 10.636873 quads
 Frequent errors and unintended use (e.g. porn)
05/04/17 34
Dietze, S., Taibi, D., Yu, R., Barker, P., d’Aquin, M., Analysing and
Improving embedded Markup of Learning Resources on the
Web, 26th International World Wide Web Conference
(WWW2017), Digital Learning track, Perth, April 2017.
Stefan Dietze
7xxxtube.com
1amateurporntube.com
virtualpornstars.com
sunriseseniorliving.com
simplyfinance.co.uk
menslifestyles.com
audiobooks.com
simplypsychology.org
helles-koepfchen.de

Entity retrieval on Web markup: state of the art
 Glimmer
(http://glimmer.research.yahoo.com)
 Entity retrieval on WDC dataset
[Blanco, Mika & Vigna, ISWC2011]
 BM25F retrieval model on WDC index

Web markup: challenges
05/04/17 36
Characteristics Example
Coreferences
18.000 results for <„Iphone 6“, type, s:Product>
(8,6 quads on average) in CommonCrawl
Redundancy <s, schema:name, „Iphone 6“> occurring 1000 times in CC
Lack of links Largely unlinked entity descriptions
Errors
(typos & schema
violations, see Meusel
et al [ESWC2015])
Wrong namespaces, such as http://schma.org
Undefined types & predicates:
9,7 %, less common than in LOD
Confusion of datatype and object properties:
<s1, s:publisher, „Springer“>, 24,35 % object property issues vs 8%
in LOD
Data property range violations: e.g. literals vs numbers
(12,6% vs 4,6 in LOD)
 Using markup as knowledge graph, similar to Linked Data?
Stefan Dietze
A Survey on Challenges for Entity Retrieval in Markup
Data, Yu, R., Gadiraju, U., Fetahu, B., Dietze, S., 15th
International Semantic Web Conference (ISWC2016),
Kobe, Japan (2016).
“Strings, not things”
 Bias towards datatype properties / using any
property as such (!)
 Numbers from LRMI2015 markup corpus:
o 46 million “transversal” quads (i.e. excluding
hierarchical statements such as rdfs:typeOf)
o 64 % are actual datatype properties yet 97%
refer to literals (up from 70% in 2013)
 Challenges
o Markup data = flat entity descriptions
(=> fairly unconnected graph)
o Data reuse requires identity resolution

 Obtaining consolidated & verified entity description/facts (or
graph) for a given resource/entity from Web markup?
 Aiding tasks: such as document annotation, augmentation
or enrichment of existing data- or knowledge bases/graphs
Entity retrieval & reconciliation on markup
05/04/17 37
Query
iPhone 6, type:(Product)
Entity Description
brand Apple Inc.
weight 129
date 30.09.2015
manufacturer Foxconn
Storage 16 GB
<e1, s:name, „Iphone 6“>
<e2, s:brand, „Apple Inc.“>
<e3, s:brand, „Apple“> <e4, s:weight, 127>
<e5, s:releaseDate, „1.12.1972“>
Web (crawl)
(e.g. Common Crawl/WDC, focused crawl)
Stefan Dietze
Markup, ICDE2017.
Yu, R., Fetahu, B., Gadiraju, U., Lehmberg, O.,
Ritze, D., Dietze, S., KnowMore - Knowledge Base

FuseM: query-centric data fusion on Web markup
05/04/17 38
 Entity matching: BM25 entity retrieval model on markup index (Common Crawl) & similarity-based matching
 Data fusion: ML classifier (SVM, knn, RandomForest), 3 feature categories (relevance, authority, clustering)
1. Matching
2. Fact selection
New Queries
Foxconn, type:(Organization)
Cupertino, type:(City)
Apple Inc., type:(Organization)
(supervised SVM classifier)
Entity Description
brand Apple Inc.
weight 129
date 30.09.2015
manufacturer Foxconn
Storage 16 GB
Query
iPhone 6, type:(Product)
Candidate Facts
node1 brand _node-x
node1 brand Apple Inc.
node1 weight 129
node2 weight 172
node2 manufacturer Foxconn
node3 releasedate 01.12.1972
node3 manufacturer Foxconn
Web page
markup
Web (crawl)
approx. 125.000 facts for „iPhone6“
Stefan Dietze
Markup, ICDE2017.

FuseM classifier: features

Evaluation & results: data fusion performance
Setup
 Dataset: Products, Movies, Books
(approx. 3 billion. facts) from Common
Crawl / WDC
 Baselines:
 BM25: top-k diverse facts via BM25
(Glimmer)
 CBFS: clustering-based approach
[ESWC2015]
 PreRecCorr: “Fusing data with
correlations” [Pochampally et. al.,
ACM SIGMOD 2014]
 10-fold cross validation
Results
 FuseM beats baselines in both tasks
(strong variance of baselines across
tasks)
 All feature categories contribute
Query-centric data fusion (precision)
Query-independent data fusion (P/R/F1)

Results: example of fused entity description
 Data fusion result for book „Brideshead Revisited“ (20 distinct facts)
New facts (compared to DBpedia):
• 60% - 70% of all facts for books & movies
new (across all KBs)
• 100% new for products
(„long tail entities“ not existing in KBs yet)
New facts and attributes

Results: KB augmentation
 Augmentation of 15 properties of
books (& movies) in three KBs
 DB: DBpedia
 FB: Freebase
 WD: Wikidata
 Augmentation performance: % of filled
slots (or „knowledge gaps“) in KB
 Performance varies heavily (yet some
attributes completed to 100%)
KBA result for entities of type „Book“
Yu, R., Fetahu, B., Gadiraju, U., Lehmberg, O.,
Ritze, D., Dietze, S., KnowMore - Knowledge Base

Linked Data & knowledge graphs
Conclusions & outlook
 Retrieval/search of Linked Data hindered by
heterogeneity, quality, dynamics etc
 Dealing with diversity & heterogeneity
o Profiling & recommendation: dataset search &
recommendation
o Entity retrieval & clustering: entity search

Entity
node1 name
Molecular structure of
nucleic acids
node1 author James D. Watson
node1 publisher Nature
node1 datePublished 1956
Entity
node2 name Francis Crick
node2 name Cricks
node2 born 1916
Embedded data/markup/tables
Unstructured (Web) data/docs
Conclusions & outlook
 Retrieval/search of Linked Data hindered by
heterogeneity, quality, dynamics etc
 Dealing with diversity & heterogeneity
o Profiling & recommendation: dataset search &
recommendation
o Entity retrieval & clustering: entity search
 New forms of (structured) Web data:
Web markup (schema.org et al.) & tables
o Convergence of structured and unstructured Web
(e.g. Voldemort KG, Tonon et al., ISWC2016)
o Scale and dynamics (!)
o Potential to augment existing knowledge graphs
(e.g. Google KG or Microsoft Satori)
o Potential training data for NED, entity interlinking
and other entity-centric tasks (e.g. OKE Challenge)

Entity
node1 name
Molecular structure of
nucleic acids
node1 author James D. Watson
node1 publisher Nature
Entity
node2 name Francis Crick
node2 name Cricks
node2 born 1916
Contact & resources
@stefandietze
http://stefandietze.net
More on Web markup: talk on
Wednesday, 11:00, WW2017/Digital
Learning track
Embedded data/markup/tables
Unstructured (Web) data/docs

Beyond Linked Data - Exploiting Entity-Centric Knowledge on the Web

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Beyond Linked Data - Exploiting Entity-Centric Knowledge on the Web

Similar a Beyond Linked Data - Exploiting Entity-Centric Knowledge on the Web (20)

Más de Stefan Dietze

Más de Stefan Dietze (13)

Último

Último (20)

Beyond Linked Data - Exploiting Entity-Centric Knowledge on the Web

Notas del editor