Towards Flexible Indices for Distributed Graph Data: The Formal Schema-level Index Model FLuID

www.moving-project.eu
TraininG towards a society of data-saVvy inforMation prOfessionals to enable open leadership INnovation
Till Blume and Ansgar Scherp
ZBW – Leibniz Information Centre for Economics
Christian-Albrechts-Universitat zu Kiel
Towards Flexible Indices for
Distributed Graph Data:
The Formal Schema-level Index Model FLuID
May 23rd, 2018, 30th GI-Workshop on Foundations of Databases (Grundlagen von Datenbanken),
22.05.2018 - 25.05.2018, Wuppertal, Germany.

2 of 17
Why use a Schema-level Index?
Towards Flexible Indices for Distributed Graph Data: The Formal Schema-level Index Model FLuID
Index
1
foaf:Agent
dct:subject
bibo:Book
dct:creator
?!
I want more
metadata!
Where to
get it from? …
2
Towards a clean air policy
Great Britain. Central Electricity
foaf:Agent
URI-1 URI-2
bibo:Book
dct:subject
URI-3
Problem:
• We are looking for a specific kind of metadata, e.g., about books.
• We do not know in which databases we can find such metadata.
• We need an index that can be queried to find matching databases.
Solution:
• A schema-level index (SLI) summarizes data by storing information of how the data is
modelled in a specific database.
• We formulate a structural query to find matching databases.

3 of 17
Real World Application Scenario
…
foaf:Agent
URI-1 URI-2
bibo:Book
dct:subject
URI-3
MOVING
platform
Index
1
foaf:Agent
dct:subject
bibo:Book
dct:creator
2
MOVING search scenario:
• The MOVING platform1 provides a search for bibliographic resources
• We harvest bibliographic metadata using different SLIs
• Such metadata is of great value since
• We can obtain good search results solely relying on the title [3].
• We can complement existing metadata.
• We can train machine learning models to further improve the search [4].
1http://platform.moving-project.eu
3

4 of 17
…
foaf:Agent
URI-1 URI-2
bibo:Book
dct:subject
URI-3
MOVING
platform
Index
1
foaf:Agent
dct:subject
bibo:Book
dct:creator
2
MOVING search scenario:
• Which SLIs are best suited to find bibliographic metadata in the Web of Data?
• Can we find semantically similar databases as well?
Proceedings of the …
Benjamin Elizalde
foaf:Agent
URI-9
URI-8
bibo:Proceedings
dct:subject
URI-6
3

5 of 17
• All schema-level indices (SLI) summarize data differently, for different
purposes, and lack a common formalization [1,2,5,7-11], for example:
• Consider incoming and outgoing properties (edges)
• Consider properties (edge label) and objects (target node)
• Consider types
• Consider types and properties
• …
• Without a common ground, it is difficult to develop new indices and compare
them to existing ones.
• Even for a single application scenario, a single SLI may not be sufficient since
how the data is modelled can vary a lot [6].
Motivation for FLuID

6 of 17
Approach
• Abstract from the Related Work (Bottom-up): Find generic, simple patterns in
existing SLIs and use them as basic building blocks to define all (complex)
schema structures that exist in previous SLIs.
• MOVING search scenario (Top-down): Flexible define indices that can reflect
semantic information and can be efficiently computed.
Solution
1. We formalized our building blocks using equivalence relations over directed
edge labeled multigraph (RDF graph).
2. We demonstrated how to model existing works and beyond.
3. We showed the scalability by conducting a complexity analysis.
The FLuID Model

7 of 17
• FLuID provides 7 schema elements:
• 3 simple elements: Object Cluster (OC), Property Cluster (PC), and Property-
Object Cluster (POC)
• 3 undirected elements: u-OC, u-PC, and u-POC
• 1 Complex Schema Element (CSE)
• FLuID provides 4 parameterizations:
• Label parameterization
• Chaining parameterization
• Ontology paramaterization
• Instance parameterization
• In total, FLuID provides 11 building blocks sufficient to model all
existing approaches and beyond.
The FLuID Model

8 of 17
• Instances: edges <s,p,o> with same subject node s, i.e.,
((i1, p1, o1), (i2, p2, o2)) ∈ I ⇔ i1 = i2.
• Edges belong to exactly 1 instance, nodes not necessarily
• Since instances partition the data graph, a set of instances also partitions the
data graph.
FLuID: Equivalence Relation Approach
i1
i2 i3
i4
i5
i6
i7
i8
i9
i10
p2
p1
p2
p1
p3
p2
p1

9 of 17
• Object Cluster: summarize instances that share a set of connected objects, i.e.,
([i1]I , [i2]I ) ∈ OC ⇔ ∀(i1, p1, o1)∃(i2, p2, o2) : o1 = o2 ∧
∀(i2, p2, o2) ∃(i1, p1, o1) : o1 = o2
The FLuID Model
i1
i2 i3
i4
i5
i6
i7
i8
i9
i10
p2
p1
p2
p1
p3
p2
p1

10 of 17
• Label Parameterized Object Cluster: summarize instances that have the set of
connected objects, if the property is p1
The FLuID Model
i1
i2 i3
i4
i5
i6
i7
i8
i9
i10
p2
p1
p2
p1
p3
p2
p1

11 of 17
connected objects, if the property is rdf:type
The FLuID Model
i1
i2 i3
i4
i5
i6
i7
i8
i9
i10
p2
rdf:type
p2
rdf:type
p3
p2
rdf:type
Bbibo:Book
Bfoaf:Agent
Bbibo:Proceedings

12 of 17
• Ontology paramaterization: RDFS Schema Graph
The FLuID Model
i1
i2 i3
i4
i5
i6
i7
i8
i9
i10
p2
rdf:type
p2
rdf:type
p3
p2
rdf:type
Bbibo:Book
Bfoaf:Agent
Bbibo:Proceedings

13 of 17
• Ontology paramaterization: RDFS Schema Graph
• Instance parameterization: owl:sameAs
The FLuID Model
i1
i2 i3
i4
i5
i6
i7
i8
i9
i10
dct:creator
rdf:type
dct:creator
rdf:type
owl:sameAs
dct:creator
rdf:type
Bbibo:Book
Bfoaf:Agent
Bbibo:Proceedings

14 of 17
A Semantic Schema-level Index
Index
1foaf:Agent
dct:subject
bibo:Book
dct:creator …
2
Proceedings of the …
Benjamin Elizalde
foaf:Agent
URI-9 URI-8
bibo:Proceedings
dct:subject
URI-6
foaf:Agent
URI-1 URI-2
bibo:Book
dct:subject
URI-3
Family planning programmes in Africa
dct:creator
Pierre Prader
URI-0
bibo:Book
dct:subject
URI-3 URI-4 URI-5
owl:sameAs
Pierre Prader
URI-5
foaf:Agent

15 of 17
• Complexity Analysis
• We show that every SLI modeled with FLuID can be computed in O(n).
• Threat: The on-the-fly inferencing! If there was a linear dependency of RDFS
triples and dataset size, we would have quadratic complexity.
• Empirical Evaluation to estimate impact of inferencing
• We analyzed two real-world datasets from the Web of Data.
• TimBL-11M: 11 million triples (edges) crawled from one seed URI.
• DyLDO-127M: 127 million triples (edges) crawled from 95,000 seed URIs.
• Practical impact of the on-the-fly inferencing: g < 1.001.
• Thus, we did not find a linear dependency but rather a constant factor.
Evaluation

16 of 17
• Conclusion
• We have presented the novel, parameterized schema-level index model
FLuID, which is sufficient to express the functionalities of existing SLIs and
beyond.
• We showed that the build-time and space complexity of any SLI developed
with FLuID scales linear with respect to the number of triples indexed.
• Outlook
• Implementing FLuID in a single computation- and query-framework
• https://github.com/t-blume/fluid-framework
• http://lodatio.informatik.uni-kiel.de/
• Qualitatively comparing existing and new approaches.
Conclusion & Outlook

17 of 17
Thank you for your attention!
Any questions?
Project consortium and funding agency
MOVING is funded by the EU Horizon 2020 Programme under the project number INSO-4-2015: 693092

18 of 17
References
1. F. Benedetti, S. Bergamaschi, and L. Po. Exposing the underlying schema of LOD sources. In Joint IEEE/WIC/ACM WI and
IAT, 2015.
2. M. Ciglan, K. Nørv˚ag, and L. Hluch´y. The SemSets model for ad-hoc semantic list search. In WWW, 2012.
3. L. Galke, F. Mai, A. Schelten, D. Brunsch, A. Scherp: Using titles vs. full-text as source for automated semantic document
annotation. In: K-CAP 2017
4. L. Galke, A. Saleh, A. Scherp: Evaluating the Impact of Word Embeddings on Similarity Scoring in Practical Information
Retrieval. In: INFORMATIK 2017
5. R. Goldman and J. Widom. DataGuides: Enabling query formulation and optimization in semistructured databases. In
VLDB 1997.
6. J. Jett, T. Nurmikko-Fuller, T.W. Cole, K.R. Page, J.S. Downie: Enhancing scholarly use of digital libraries: A comparative
survey and review of bibliographic metadata ontologies. In: JCDL 2016
7. M. Konrath, T. Gottron, S. Staab, and A. Scherp. SchemEX - efficient construction of a data catalogue by stream-based
indexing of Linked Data. J. Web Sem., 16:52–58, 2012.
8. J. McHugh, S. Abiteboul, R. Goldman, D. Quass, and J. Widom. Lore: a database management system for semistructured
data. SIGMOD Record, 26(3):54–66, 1997.
9. T. Neumann and G. Moerkotte. Characteristic sets: Accurate cardinality estimation for RDF queries with multiple joins. In
ICDE, 2011.
10. J. Schaible, T. Gottron, and A. Scherp. TermPicker: Enabling the reuse of vocabulary terms by exploiting data from the
Linked Open Data cloud. In ESWC, 2016.
11. B. Spahiu, R. Porrini, M. Palmonari, A. Rula, and A. Maurino. ABSTAT: ontology-driven Linked Data summaries with pattern
minimalization. In ESWC Satellite Events, Revised Selected Papers, 2016.

19 of 17
Search Engine Prototype: LODatio+
http://lodatio.informatik.uni-kiel.de

20 of 17
http://platform.moving-project.eu

Towards Flexible Indices for Distributed Graph Data: The Formal Schema-level Index Model FLuID

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Towards Flexible Indices for Distributed Graph Data: The Formal Schema-level Index Model FLuID

Similar to Towards Flexible Indices for Distributed Graph Data: The Formal Schema-level Index Model FLuID (20)

Recently uploaded

Recently uploaded (20)

Towards Flexible Indices for Distributed Graph Data: The Formal Schema-level Index Model FLuID

Editor's Notes