Beyond Linked Data - Exploiting Entity-Centric Knowledge on the Web
1. Beyond Linked Data –
Exploiting Entity-Centric Knowledge on the Web
Stefan Dietze
L3S Research Center, Hannover, Germany
- Linked Data on the Web (LDOW2017), WWW2017 -
05/04/17 1Stefan Dietze
2. Research areas
Web science, Information Retrieval, Semantic Web, Social Web
Analytics, Knowledge Discovery, Human Computation
Interdisciplinary application areas: digital humanities,
TEL/education, Web archiving, mobility, ...
Some projects
Research @ L3S
05/04/17 2
See also: http://www.l3s.de
Stefan Dietze
3. Acknowledgements: team
05/04/17 3Stefan Dietze
Pavlos Fafalios (L3S)
Besnik Fetahu (L3S)
Elena Demidova (L3S)
Ujwal Gadiraju (L3S)
Eelco Herder (L3S)
Ivana Marenzi (L3S)
Nicolas Tempelmeier (L3S)
Ran Yu (L3S)
Nilamadhaba Mohapatra (L3S, IIT India)
Bernardo Pereira Nunes (L3S, PUC Rio de Janeiro)
Mathieu d‘Aquin (The Open University, UK)
Mohamed Ben Ellefi (LIRMM, France)
Davide Taibi (CNR, Italy)
Konstantin Todorov (LIRMM, France)
...
4. Back in September 2016
05/04/17 4Stefan Dietze
A new look at the semantic web. Abraham
Bernstein, James Hendler, Natalya Noy,
Communications of the ACM, Vol. 59 No. 9, Pages 35-
37, September 2016
Retrieval, Crawling and Fusion of Entity-centric Data
on the Web, Dietze, S., in Semantic Keyword-Based
Search on Structured Data Sources, In: Calì A., Gorgan
D., Ugarte M. (eds) Semantic Keyword-Based Search on
Structured Data Sources. KEYSTONE 2016. LNCS, Vol
10151. Springer, 2017.
5. Overview
05/04/17Stefan Dietze 6
I – Challenges
II – Enabling discovery & search in Linked Data & Knowledge Graphs
Dataset recommendation
Dataset profiling
Entity retrieval
III – Beyond Linked Data – exploiting embedded Web semantics
Web markup as emerging data source
Case studies
Data fusion for entity reconciliation (and retrieval)
III Wrap-up
Other emerging forms of
semantics/structured data on
the Web („Future“)
Dealing with heterogeneity &
shortcomings („Present“)
6. Data accessibility & quality?
SPARQL endpoint availability over time [Buil-Aranda et al 2013]
Accessibility of (linked) datasets?
Less than 50% of all SPARQL endpoints actually responsive at given point of time [Buil-Aranda2013]
“THE” SPARQL protocol? No, but variants, subsets and local restrictions
Semantics, links, quality?
…data accuracy (eg DBpedia)? [Paulheim2013]
…schema compliance & evolution [HoganJWS2012]
…vocabulary reuse? [D’AquinWebSci13]
Stefan Dietze
Assessing the Educational Linked Data Landscape, D’Aquin, M., Adamou, A.,
Dietze, S., ACM Web Science 2013 (WebSci2013), Paris, France, May 2013.
Type Inference on Noisy RDF Data, Paulheim H., Bizer, C. Semantic Web – ISWC
2013, Lecture Notes in Computer Science Volume 8218, 2013, pp 510-525
An empirical survey of Linked Data conformance. Hogan, A., Umbrich, J., Harth,
A., Cyganiak, R., Polleres, A., Decker., S., Journal of Web Semantics 14, 2012
05/04/17 7
SPARQL Web-Querying Infrastructure: Ready for Action?, Carlos Buil-
Aranda, Aidan Hogan, Jürgen Umbrich Pierre-Yves Vandenbussch, International
Semantic Web Conference 2013, (ISWC2013).
7. Co-occurence of
types
(in 146 datasets:
144 vocabularies,
588 overlapping
types, 719
predicates)
Assessing the Educational Linked Data Landscape,
D’Aquin, M., Adamou, A., Dietze, S., ACM Web
Science 2013 (WebSci2013), Paris, May 2013.
po:Programme
yov:Video
?
bibo:Book
Vocabulary reuse/linking?
05/04/17 8Stefan Dietze
8. typeX
typeX
Co-occurence after
mapping
(201 frequently
occuring types,
mapped into 79 types)
bibo:Film
bibo:Document
po:Programme
bibo:Book
foaf:Document
yov:Video
typeX
Co-occurence of
types
(in 146 datasets:
144 vocabularies,
588 overlapping
types, 719
predicates)
05/04/17 9
Vocabulary reuse/linking? Assessing the Educational Linked Data Landscape,
D’Aquin, M., Adamou, A., Dietze, S., ACM Web
Science 2013 (WebSci2013), Paris, May 2013.
9. “Completeness” ?
05/04/17Stefan Dietze 10
Example: varying completeness of “book” (“movie”) entity
descriptions
Missing facts: 49.8% (37.1%) in DBpedia, 63.8% (23.3%) in
Freebase and 60.9 % (40%) in Wikidata
(varies heavily across attributes)
Yu, R., Fetahu, B., Gadiraju, U., Dietze, S., FuseM:
Query-Centric Data Fusion on Structured Web
Markup, ICDE2017.
Yu, R., Fetahu, B., Gadiraju, U., Lehmberg, O., Ritze,
D., Dietze, S., KnowMore - Knowledge Base
Augmentation with Structured Web Markup,
Semantic Web Journal 2017, under review.
10. Consistency? Analyzing Relative Incompleteness of Movie
Descriptions in the Web of Data: A Case Study,
Yuan, W., Demidova, E., Dietze, S., Zhu, X.,
ISWC2014
05/04/17Stefan Dietze 11
11. Challenge for search/retrieval – heterogeneity of datasets & entities
Stefan Dietze 05/04/17
??? ?? ?
Discovery of suitable (1) datasets & (2) entities:
Quality? Currentness, dynamics, accessability/reliability,
data quantity & quality?
Topics/scope? Datasets/entities useful & trustworthy for
topic XY?
Types? Datasets/entities about statistics, organisations,
videos, slides, publications etc?
12
12. Overview
05/04/17Stefan Dietze 13
I – Challenges
II – Enabling discovery & search in Linked Data & Knowledge Graphs
Dataset recommendation
Dataset profiling
Entity retrieval
III – Beyond Linked Data – exploiting embedded Web semantics
Web markup as emerging data source
Case studies
Data fusion for entity reconciliation (and retrieval)
III Wrap-up
Other emerging forms of
semantics/structured data on
the Web („Future“)
Dealing with heterogeneity &
shortcomings („Now“)
13. 05/04/17
Dataset recommendation I
14
S
Linkset1
Linkset2
Approach
Given dataset s, ranking datasets from D
according to probability score (di, t) to
contain linking candidates (entities)
Features:
Approach 1: vocabulary overlap
Approach 2: existing links (SNA)
Linking candidates likely if datasets share
common (a) schema elements, or (b) links
(friend of a friend)
Conclusions
Roughly 50% MAP for both approaches
Simplistic approach (!)
Lopes, G.R., Paes Leme, L. A., Nunes, B.P., Casanova,
M.A., Dietze, S., Two approaches to the dataset
interlinking recommendation problem, 15th
International Conference on Web Information System
Engineering (WISE 2014), Thessaloniki, Greece.
Rank
1 DBLP
2 ACM
3 OAI
4 CiteSeer
5 IBM
6 Roma
7 IEEE
8 Ulm
9 Pisa
?
?
Stefan Dietze 14
Goal: finding candidate datasets, e.g. for entity retrieval
or interlinking tasks (eg enrichment)
14. Dataset recommendation II
05/04/17
Ben Ellefi, M., Bellahsene, Z., Dietze, S., Todorov, K.,
Intension-based Dataset Recommendation for Data
Linking, 13th Extended Semantic Web Conference
(ESWC2016), Heraklion, Crete, May, 2016, ESWC2016
Stefan Dietze 15
L. Han, A. L. Kashyap, T. Finin, J. Mayeld, and J. Weese, "Umbc ebiquity-core: Semantic textual similarity systems", in Proc. of the *SEM, Association for Computational Linguistics, 2013.
Preprocessing Datasets rankingDatasets filtering
15. Dataset recommendation II: results
05/04/17Stefan Dietze 16
Data & ground truth
Experiments on (responsive) datasets
from LOD Cloud (http://datahub.io)
Concept profiles from
http://lov.okfn.org
Ground truth: existing links from VOID
profiles of datasets
(issue: not always representative for
actual linksets)
Results
MAP for different similarity thresholds
from step 2 max. 54% (UMBC@0.7)
Recall 100% below indicated similarity
(clustering) thresholds
Ben Ellefi, M., Bellahsene, Z., Dietze, S., Todorov, K.,
Intension-based Dataset Recommendation for Data
Linking, 13th Extended Semantic Web Conference
(ESWC2016), Heraklion, Crete, May, 2016, ESWC2016
16. Dataset search through dataset cataloging & profiling
Dataset
Catalog/Registry
http://data.linkededucation.org/linkedup/catalog/
LinkedUp project (FP7 project: L3S, OU, OKFN, Elsevier, Exact Learning solutions)
LinkedUp Catalog: largest collection of LD of educationally relevant resources (approx. 50 Datasets)
Original datasets published with key content providers, automatically extracted metadata
05/04/17 17Stefan Dietze
19. db:Biology
db:Cell biology
Dataset
Catalog/Registry
yov:Video
<yo:Video …>
<dc:title>Lecture 29 –
Stem Cells</dc:title>
…
</yo:Video…>
Yovisto Video
Extraction of representative (DBpedia) categories („topic profile“) for arbitrary datasets ?
Technically trivial through established NER/NED approaches, but scalability issues
(recall: LOD Cloud 1000+ datasets with <100 billion RDF statements)
Efficient approach: sampling & ranking for balance between scalability and precision /recall
Scalable profiling of datasets
A Scalable Approach for Efficiently Generating
Structured Dataset Topic Profiles, Fetahu, B.,
Dietze, S., Nunes, B. P., Casanova, M. A., Nejdl,
W., 11th Extended Semantic Web Conference
(ESWC2014), Crete, Greece, (2014).
db:Cell
(Biology)
05/04/17 21
db:Cell
(Biology)
Stefan Dietze
20. Efficient dataset profiling
1. Sampling of resources
(random sampling, weighted sampling, resource
centrality sampling)
2. Entity- & topic-extraction (NER via DBpedia Spotlight,
category mapping & -expansion)
3. Normalisation & ranking (graph-based models such as
PageRank with Priors, HITS with Priors & K-Step Markov)
Result: weighted dataset-topic profile graph
05/04/17 22Stefan Dietze
A Scalable Approach for Efficiently Generating
Structured Dataset Topic Profiles, Fetahu, B.,
Dietze, S., Nunes, B. P., Casanova, M. A., Nejdl,
W., 11th Extended Semantic Web Conference
(ESWC2014), Crete, Greece, (2014).
21. Search & exploration of datasets through topic profiles
Applied to entire LOD cloud/graph
Visual exploration of extracted RDF dataset profiles
(datasets, topics, relationships)
Evaluation results: K-Step Markov (10% sampling size)
outperforms baselines (LDA, tf/idf on entire datasets)
http://data-observatory.org/lod-profiles/
05/04/17 23Stefan Dietze
22. Search: entity retrieval on large LD crawls?
How to efficiently retrieve (related) entities/resources for given entity-seeking (keyword) query?
State of the art: BM25F on inverted entity index (Blanco et al, ISWC2011)
Challenges/observations:
Explicit entity links (owl:sameAs etc) are sparse yet important to facilitate state of the art methods
Query type affinity?
05/04/17 24Stefan Dietze
??
Large dataset/crawl
e.g. LinkedUp dataset graph, BTC2014, Dynamic LD Observatory
entities related to <Tim Berners Lee>
?
BTC2014
DyLDO
23. Entity retrieval: approach
(I) Offline processing (clustering to address link sparsity)
1. Feature vectors (lexical and structural features)
2. Bucketing: per type (LSH algorithm)
3. Clustering: X-means & Spectral clustering per bucket
Improving Entity Retrieval on Structured Data,
Fetahu, B., Gadiraju, U., Dietze, S., 14th
International Semantic Web Conference
(ISWC2015), Bethlehem, US, (2015).
(II) Online processing (retrieval)
1. Retrieval & expansion:
a) BM25F results
b) expansion from clusters (related entities)
2. Re-Ranking
(context terms & query type affinity)
05/04/17 25Stefan Dietze
24. Dataset
BTC2014 (4 billion entities)
92 SemSearch queries
Methods
Our approaches: XM: Xmeans, SP: Spectral
Baselines B: BM25F, S1: Tonon et al [SIGIR12]
Conclusions
XM & SP outperform baselines
Clustering to remedy link sparsity
(yet extensive offline processing required)
Relevance to query more important than
relevance to BM25F results
Entity retrieval: evaluation
05/04/17 26Stefan Dietze
Improving Entity Retrieval on Structured Data,
Fetahu, B., Gadiraju, U., Dietze, S., 14th
International Semantic Web Conference
(ISWC2015), Bethlehem, US, (2015).
25. PROFILES2017 - Profiling & search of Linked Data
05/04/17 27Stefan Dietze
https://profiles2017.wordpress.com/
• Probably co-located with ISWC2017 (Vienna)
• Submissions due 21 June
26. Overview
05/04/17Stefan Dietze 28
I – Challenges
II – Enabling discovery & search in Linked Data & Knowledge Graphs
Dataset recommendation
Dataset profiling
Entity retrieval
III – Beyond Linked Data – exploiting embedded Web semantics
Web markup as emerging data source
Case studies
Data fusion for entity reconciliation (and retrieval)
III Wrap-up
Other emerging forms of
structured data on the Web
(„Future“)?
Dealing with heterogeneity &
shortcomings („Present“)
27. Linked Data: approx.
1000+ datasets & 100 billion statements
Open Data: XXX datasets
Web semantics & entity-centric Web data
05/04/17 29Stefan Dietze
Web (of documents):
approx. 46.000.000.000.000 (46 trillion)
Web pages indexed by Google
Other forms of Web semantics
and entity-centric knowledge?
Dynamics?
Quality?
Accessibility?
Scale?
28. Embedded markup (RDFa, Microdata, Microformats) for
interpretation of Web documents (search, retrieval)
Arbitrary vocabularies; schema.org used at scale:
(700 classes, 1000 predicates)
Adoption on the Web: 26 %
(2014 Google study of 12 bn Web pages)
“Web Data Commons” (Meusel & Paulheim [ISWC2014])
• Markup from Common Crawl (3.2 billion pages):
44 billion RDF quads (2016)
• Markup in 38% of pages in 2016
Same order of magnitude as “the Web” (!)
Embedded Web page markup & schema.org
<div itemscope itemtype ="http://schema.org/Movie">
<h1 itemprop="name">Forrest Gump</h1>
<span>Actor: <span itemprop=„actor">Tom Hanks</span>
<span itemprop="genre">Drama</span>
...
</div>
05/04/17 30
RDF statements
node1 actor _node-x
node1 actor Robin Wright
node1 genre Comedy
node2 actor T. Hanks
node2 distributed by Paramount Pic.
node3 actor Tom Cruise
node3 distributed by Paramount Pic.
Stefan Dietze
http://webdatacommons.org
29. schema:Product instances in WDC2015
Facts: 1.414.937.431
(= 302.246.120 instances, i.e. products)
Providers (distinct Pay Level Domains, PLDs): 93.705
Power law distribution of terms across PLDs
Top 10 PLDs
Top provider ? (company)
05/04/17 31Stefan Dietze
Example: embedded Web markup about „products“
PLD # Resources
www.crateandbarrel.com 33.517.936,00
www.bentgate.com 17.215.499,00
www.aliexpress.com 9.621.943,00
www.ebay.com.au 8.861.308,00
us.fotolia.com 7.939.982,00
www.ebay.co.uk 6.556.820,00
www.competitivecyclist.com 6.214.500,00
www.maxstudio.com 6.075.626,00
approx. 35 million resources
30. 1
10
100
1000
10000
100000
1000000
10000000
1 51 101 151 201
count(log)
PLD (ranked)
# entities # statements
Study on sample Web crawl (WDC2015)
Metadata about scholarly articles, e.g.
s:ScholarlyArticle): 6.793.764 quads, 1.184.623
entities, 429 distinct predicates
(in WDC and for 1 type alone)
Top 5 domains: Springer, MDPI, BMJ,
mendeley.com, Biodiversitylibrary.org
Domains, topics, disciplines?
Life Sciences and Computer Science predominant
Top-10 article titles
Noise
Example: markup of bibliographic resources
05/04/17 32Stefan Dietze
Sahoo, P., Gadiraju, U., Yu, R., Saha, S., Dietze, S.,
Analysing Structured Scholarly Data embedded in Web
Pages, SAVE-SD2016, co-located with the WWW2016
31. Example: markup of learning resources on the Web
“Learning Resources Metadata Intiative (LRMI)”:
schema.org vocabulary for annotation of learning
resources
Developed through DCMI Task Force on LRMI
Approx. 5000 PLDs (incl. subdomains) in CC
LRMI adoption (WDC) [WWW17]:
2015: 44,108,511 quads
2014: 30,599,024 quads
2013: 10.636873 quads
05/04/17 33
Dietze, S., Taibi, D., Yu, R., Barker, P., d’Aquin, M., Analysing and
Improving embedded Markup of Learning Resources on the
Web, 26th International World Wide Web Conference
(WWW2017), Digital Learning track, Perth, April 2017.
Stefan Dietze
32. Example: markup of learning resources on the Web
“Learning Resources Metadata Intiative (LRMI)”:
schema.org vocabulary for annotation of learning
resources
Developed through DCMI Task Force on LRMI
Approx. 5000 PLDs (incl. subdomains) in CC
LRMI adoption (WDC) [WWW17]:
2015: 44,108,511 quads
2014: 30,599,024 quads
2013: 10.636873 quads
Frequent errors and unintended use (e.g. porn)
05/04/17 34
Dietze, S., Taibi, D., Yu, R., Barker, P., d’Aquin, M., Analysing and
Improving embedded Markup of Learning Resources on the
Web, 26th International World Wide Web Conference
(WWW2017), Digital Learning track, Perth, April 2017.
Stefan Dietze
7xxxtube.com
1amateurporntube.com
virtualpornstars.com
sunriseseniorliving.com
simplyfinance.co.uk
menslifestyles.com
audiobooks.com
simplypsychology.org
helles-koepfchen.de
33. 05/04/17 35Stefan Dietze
Entity retrieval on Web markup: state of the art
Glimmer
(http://glimmer.research.yahoo.com)
Entity retrieval on WDC dataset
[Blanco, Mika & Vigna, ISWC2011]
BM25F retrieval model on WDC index
34. Web markup: challenges
05/04/17 36
Characteristics Example
Coreferences
18.000 results for <„Iphone 6“, type, s:Product>
(8,6 quads on average) in CommonCrawl
Redundancy <s, schema:name, „Iphone 6“> occurring 1000 times in CC
Lack of links Largely unlinked entity descriptions
Errors
(typos & schema
violations, see Meusel
et al [ESWC2015])
Wrong namespaces, such as http://schma.org
Undefined types & predicates:
9,7 %, less common than in LOD
Confusion of datatype and object properties:
<s1, s:publisher, „Springer“>, 24,35 % object property issues vs 8%
in LOD
Data property range violations: e.g. literals vs numbers
(12,6% vs 4,6 in LOD)
Using markup as knowledge graph, similar to Linked Data?
Stefan Dietze
A Survey on Challenges for Entity Retrieval in Markup
Data, Yu, R., Gadiraju, U., Fetahu, B., Dietze, S., 15th
International Semantic Web Conference (ISWC2016),
Kobe, Japan (2016).
“Strings, not things”
Bias towards datatype properties / using any
property as such (!)
Numbers from LRMI2015 markup corpus:
o 46 million “transversal” quads (i.e. excluding
hierarchical statements such as rdfs:typeOf)
o 64 % are actual datatype properties yet 97%
refer to literals (up from 70% in 2013)
Challenges
o Markup data = flat entity descriptions
(=> fairly unconnected graph)
o Data reuse requires identity resolution
35. Obtaining consolidated & verified entity description/facts (or
graph) for a given resource/entity from Web markup?
Aiding tasks: such as document annotation, augmentation
or enrichment of existing data- or knowledge bases/graphs
Entity retrieval & reconciliation on markup
05/04/17 37
Query
iPhone 6, type:(Product)
Entity Description
brand Apple Inc.
weight 129
date 30.09.2015
manufacturer Foxconn
Storage 16 GB
<e1, s:name, „Iphone 6“>
<e2, s:brand, „Apple Inc.“>
<e3, s:brand, „Apple“> <e4, s:weight, 127>
<e5, s:releaseDate, „1.12.1972“>
Web (crawl)
(e.g. Common Crawl/WDC, focused crawl)
Stefan Dietze
Yu, R., Fetahu, B., Gadiraju, U., Dietze, S., FuseM:
Query-Centric Data Fusion on Structured Web
Markup, ICDE2017.
Yu, R., Fetahu, B., Gadiraju, U., Lehmberg, O.,
Ritze, D., Dietze, S., KnowMore - Knowledge Base
Augmentation with Structured Web Markup,
Semantic Web Journal 2017, under review.
36. FuseM: query-centric data fusion on Web markup
05/04/17 38
Entity matching: BM25 entity retrieval model on markup index (Common Crawl) & similarity-based matching
Data fusion: ML classifier (SVM, knn, RandomForest), 3 feature categories (relevance, authority, clustering)
1. Matching
2. Fact selection
New Queries
Foxconn, type:(Organization)
Cupertino, type:(City)
Apple Inc., type:(Organization)
(supervised SVM classifier)
Entity Description
brand Apple Inc.
weight 129
date 30.09.2015
manufacturer Foxconn
Storage 16 GB
Query
iPhone 6, type:(Product)
Candidate Facts
node1 brand _node-x
node1 brand Apple Inc.
node1 weight 129
node2 weight 172
node2 manufacturer Foxconn
node3 releasedate 01.12.1972
node3 manufacturer Foxconn
Web page
markup
Web (crawl)
approx. 125.000 facts for „iPhone6“
Stefan Dietze
Yu, R., Fetahu, B., Gadiraju, U., Dietze, S., FuseM:
Query-Centric Data Fusion on Structured Web
Markup, ICDE2017.
38. Evaluation & results: data fusion performance
05/04/17 40Stefan Dietze
Setup
Dataset: Products, Movies, Books
(approx. 3 billion. facts) from Common
Crawl / WDC
Baselines:
BM25: top-k diverse facts via BM25
(Glimmer)
CBFS: clustering-based approach
[ESWC2015]
PreRecCorr: “Fusing data with
correlations” [Pochampally et. al.,
ACM SIGMOD 2014]
10-fold cross validation
Results
FuseM beats baselines in both tasks
(strong variance of baselines across
tasks)
All feature categories contribute
Query-centric data fusion (precision)
Query-independent data fusion (P/R/F1)
39. 05/04/17 42Stefan Dietze
Results: example of fused entity description
Data fusion result for book „Brideshead Revisited“ (20 distinct facts)
New facts (compared to DBpedia):
• 60% - 70% of all facts for books & movies
new (across all KBs)
• 100% new for products
(„long tail entities“ not existing in KBs yet)
New facts and attributes
40. 05/04/17 43Stefan Dietze
Results: KB augmentation
Augmentation of 15 properties of
books (& movies) in three KBs
DB: DBpedia
FB: Freebase
WD: Wikidata
Augmentation performance: % of filled
slots (or „knowledge gaps“) in KB
Performance varies heavily (yet some
attributes completed to 100%)
KBA result for entities of type „Book“
Yu, R., Fetahu, B., Gadiraju, U., Lehmberg, O.,
Ritze, D., Dietze, S., KnowMore - Knowledge Base
Augmentation with Structured Web Markup,
Semantic Web Journal 2017, under review.
41. Linked Data & knowledge graphs
Conclusions & outlook
05/04/17 45Stefan Dietze
Retrieval/search of Linked Data hindered by
heterogeneity, quality, dynamics etc
Dealing with diversity & heterogeneity
o Profiling & recommendation: dataset search &
recommendation
o Entity retrieval & clustering: entity search
42. Entity
node1 name
Molecular structure of
nucleic acids
node1 author James D. Watson
node1 publisher Nature
node1 datePublished 1956
node1 datePublished 1953
Entity
node2 name Francis Crick
node2 name Cricks
node2 born 1916
Embedded data/markup/tables
Unstructured (Web) data/docs
Linked Data & knowledge graphs
Conclusions & outlook
05/04/17 46Stefan Dietze
Retrieval/search of Linked Data hindered by
heterogeneity, quality, dynamics etc
Dealing with diversity & heterogeneity
o Profiling & recommendation: dataset search &
recommendation
o Entity retrieval & clustering: entity search
New forms of (structured) Web data:
Web markup (schema.org et al.) & tables
o Convergence of structured and unstructured Web
(e.g. Voldemort KG, Tonon et al., ISWC2016)
o Scale and dynamics (!)
o Potential to augment existing knowledge graphs
(e.g. Google KG or Microsoft Satori)
o Potential training data for NED, entity interlinking
and other entity-centric tasks (e.g. OKE Challenge)
43. Entity
node1 name
Molecular structure of
nucleic acids
node1 author James D. Watson
node1 publisher Nature
node1 datePublished 1956
node1 datePublished 1953
Entity
node2 name Francis Crick
node2 name Cricks
node2 born 1916
Contact & resources
05/04/17 47Stefan Dietze
@stefandietze
http://stefandietze.net
More on Web markup: talk on
Wednesday, 11:00, WW2017/Digital
Learning track
Embedded data/markup/tables
Unstructured (Web) data/docs
Linked Data & knowledge graphs
Notas del editor
Definition 2.1. ith-Order Value Inconsistency (Dx,
Dy, P) between the pair of datasets Dx, Dy with respect to
the ith-Order single-value property P is the proportion of the
equivalent entities in Dx and Dy having contradicting values
in P
Definition 2.2. ith-Order Value Incompleteness (Dx,
Dy, P) between the pair of datasets Dx, Dy with respect to a
ith-Order multi-value property P is the proportion of entities
in Dx and Dy having dierent values in P.
ISSUES: different can mean incorrectness as well as incompleteness
Filtering: identifying cluster of datasets which are similar to Ds (two metrics: LSA-based, Wordnet-based), threshold theta
Ranking: cosine between profiles
Experimentally better results than using the ranks from filtering step
Evalualtion: map for different similarity thresholds (theta) from filtering step
when explaining:
- why is MAP decreasing with higher similarity thresholds?
&quot;For the given intervals [0, 0.7], [0, 0.8] and [0, 0.9], with respect to the used measures, we have 100% of recall --&gt; all datasets considered as true are present in the recommanda list.
Random Sampling: randomly selects resource instances from Ri 2 Di for
further analysis in the proling pipeline.
Weighted Sampling: weigh each resource as the ratio of the number of
datatype properties used to dene a resource over the maximum number of
datatype properties over all resources for a specic dataset. The weight for rk
Fig. 1. Processing pipeline for generating structured proles of Linked Data graphs.
is computed by wk = jf(rk)j=maxfjf(rj)jg (rj 2 Rijj = 1; ; n), where f(rk)
represents the datatype properties of resource rk. An instance is included in a
sample if, for a randomly generated number p from a uniform distribution, the
weight wk such that wk &gt; (1 p). Such a strategy ensures that resources that
carry more information (having more literal values) have higher chances of being
included earlier at low cut-os of analysed samples.
Resource Centrality Sampling: weighs each resource as the ratio of the
number of resource types used to describe a particular resource (V 0
k Vk) divided
by the total number of resource types in a dataset. The weight is dened by
ck = jC0k
j=jCj with C0k
= C \ V 0
k. Similarly to `weighted sampling&apos;, for a
randomly generated number p, rk is included in the sample if ck &gt; (1 p). The
main motivation behind computing the centrality of a resource is that important
concepts in a dataset tend to be more structured and linked to other concepts.
The underlying assumption is that very specific and targeted seed lists will require different crawling and relevance computation methods than very broad and unspecific seed lists.
http://www.visualdataweb.org/relfinder/demo.swf?obj1=TWluaW9ucyAoZmlsbSl8aHR0cDovL2RicGVkaWEub3JnL3Jlc291cmNlL01pbmlvbnNfKGZpbG0p&obj2=U2FuZHJhIEJ1bGxvY2t8aHR0cDovL2RicGVkaWEub3JnL3Jlc291cmNlL1NhbmRyYV9CdWxsb2Nr&obj3=Sm9uIEhhbW18aHR0cDovL2RicGVkaWEub3JnL3Jlc291cmNlL0pvbl9IYW1t&obj4=TWljaGFlbCBLZWF0b258aHR0cDovL2RicGVkaWEub3JnL3Jlc291cmNlL01pY2hhZWxfS2VhdG9u&obj5=QWxsaXNvbiBKYW5uZXl8aHR0cDovL2RicGVkaWEub3JnL3Jlc291cmNlL0FsbGlzb25fSmFubmV5&obj6=RGVzcGljYWJsZSBNZSAyfGh0dHA6Ly9kYnBlZGlhLm9yZy9yZXNvdXJjZS9EZXNwaWNhYmxlX01lXzI=&obj7=U3RldmUgQ29vZ2FufGh0dHA6Ly9kYnBlZGlhLm9yZy9yZXNvdXJjZS9TdGV2ZV9Db29nYW4=&obj8=R2VvZmZyZXkgUnVzaHxodHRwOi8vZGJwZWRpYS5vcmcvcmVzb3VyY2UvR2VvZmZyZXlfUnVzaA==&name=REJwZWRpYSAobWlycm9yKQ==&abbreviation=ZGJw&description=TGlua2VkIERhdGEgdmVyc2lvbiBvZiBXaWtpcGVkaWEu&endpointURI=aHR0cDovL2RicGVkaWEuaW50ZXJhY3RpdmVzeXN0ZW1zLmluZm8=&dontAppendSPARQL=ZmFsc2U=&defaultGraphURI=aHR0cDovL2RicGVkaWEub3Jn&isVirtuoso=dHJ1ZQ==&useProxy=ZmFsc2U=&method=UE9TVA==&autocompleteLanguage=ZW4=&autocompleteURIs=aHR0cDovL3d3dy53My5vcmcvMjAwMC8wMS9yZGYtc2NoZW1hI2xhYmVs&ignoredProperties=aHR0cDovL3d3dy53My5vcmcvMTk5OS8wMi8yMi1yZGYtc3ludGF4LW5zI3R5cGUsaHR0cDovL3d3dy53My5vcmcvMjAwNC8wMi9za29zL2NvcmUjc3ViamVjdCxodHRwOi8vZGJwZWRpYS5vcmcvcHJvcGVydHkvd2lraVBhZ2VVc2VzVGVtcGxhdGUsaHR0cDovL2RicGVkaWEub3JnL3Byb3BlcnR5L3dvcmRuZXRfdHlwZSxodHRwOi8vZGJwZWRpYS5vcmcvcHJvcGVydHkvd2lraWxpbmssaHR0cDovL2RicGVkaWEub3JnL29udG9sb2d5L3dpa2lQYWdlV2lraUxpbmssaHR0cDovL3d3dy53My5vcmcvMjAwMi8wNy9vd2wjc2FtZUFzLGh0dHA6Ly9wdXJsLm9yZy9kYy90ZXJtcy9zdWJqZWN0&abstractURIs=aHR0cDovL2RicGVkaWEub3JnL29udG9sb2d5L2Fic3RyYWN0&imageURIs=aHR0cDovL2RicGVkaWEub3JnL29udG9sb2d5L3RodW1ibmFpbCxodHRwOi8veG1sbnMuY29tL2ZvYWYvMC4xL2RlcGljdGlvbg==&linkURIs=aHR0cDovL3B1cmwub3JnL29udG9sb2d5L21vL3dpa2lwZWRpYSxodHRwOi8veG1sbnMuY29tL2ZvYWYvMC4xL2hvbWVwYWdlLGh0dHA6Ly94bWxucy5jb20vZm9hZi8wLjEvcGFnZQ==&maxRelationLegth=Mg==
As shown in our experimental evaluation, specific entities within a seed list strongly reflect the crawl intent.
{Pulp Fiction, Film, Entertainment}, the most specific entity \texttt{`Pulp Fiction&apos;}, reflects the most specific crawl intent, whereas the entities \texttt{`Film&apos;} and \texttt{`Entertainment&apos;} provide contextual information, namely that \texttt{`Pulp Fiction&apos;} is a movie.
Motivated by this, we assume that the relevance of specific candidate entities is dependent on the seed entity they are related to. For example, candidate entities similar to entity \texttt{`Pulp Fiction&apos;} will be ranked higher than entities that are similar to other seed entities.
The average improvement across different NDCG levels is 1.6% on depth 2 and 4.3%
on depth 3, suggesting a positive effect of the attrition factor for the cases of our seed
lists. On the other hand, the coherence of the seed list appears to have no significant
impact on the suitability of particular configuration.
Given the significantly increased runtime
when crawling beyond hop 2, a crawl depth of 2 seems to provide optimal efficiency,
and it is not advisable to crawl to a higher distance.
The average improvement across different NDCG levels is 1.6% on depth 2 and 4.3%
on depth 3, suggesting a positive effect of the attrition factor for the cases of our seed
lists. On the other hand, the coherence of the seed list appears to have no significant
impact on the suitability of particular configuration.
Given the significantly increased runtime
when crawling beyond hop 2, a crawl depth of 2 seems to provide optimal efficiency,
and it is not advisable to crawl to a higher distance.
This is due to the fact that high coherence seed lists have a more specific crawl intent, leading to narrow and often small result sets, and hence also a limited ground truth, while the low coherence lists have a much broader crawl intent as well as relevant entity set. This is reflected in our ground truth: the average number of entities labeled as related (score≥ 3 and beyond) is 208 for low coherence seed list, and 145 for high coherence seed lists. Meanwhile, the narrow search intent also causes more disagreement among crowdsourcing workers for generating the ground truth, which makes the results for high coherence seed lists less consensual.
Another difficulty faced when evaluating the crawling task is the highly heteroge- neous and varied nature of the possible result sets, originating from a highly heteroge- neous Linked Data graph.