Big Linked Data Federation - ExtremeEarth Open Workshop

December 9, 2021
Three Years of the ExtremeEarth Project
Online Workshop
Antonis Troumpoukis (NCSR-D)
Federating Big Linked Geospatial Data

Outline
• Semagrow federated query processor
• KOBE benchmarking engine
• ExtremeEarth Use-cases

4
Federated query processors
• Systems that seamlessly integrate data from multiple
remote dataset servers.
• Receive a query, issue the necessary subqueries in the
remote servers, combine the results accordingly, and
present the result to the user.
• Used thoroughly in Linked-Data; there exist many data
providers that publish their datasets through SPARQL
endpoints.

5
Semagrow federated query processor
• Semagrow is an open source dynamic data integration system:
o makes the best out of all public data, regardless of their size, update rate, and schema.
o presents to client applications a single, unified SPARQL endpoint that federates multiple data sources.
o manages both syntactic and semantic heterogeneity.
• The federated data sources may serve data that use different vocabularies and codelists
o Semagrow dynamically transforms responses from the different data sources to match the vocabularies used in the
query.
• The federated data sources may offer SPARQL, GeoSPARQL, SQL, or CQL (CassandraQL) APIs.
o Semagrow processes SPARQL queries and appropriately re-writes the sub-queries for each data source.
o Semagrow fills in the missing expressivity, e.g. arbitrary joins for CQL sources

6
History
• Originally developed in FP7 SemaGrow:
o SPARQL endpoint federation engine [1]
o Multi-threaded application, deployed directly on server or VM
o Dynamic vocabulary mapping
• Extended in H2020 BigDataEurope:
o Containerization and ability to deploy on Cloud infrastructures
o Re-engineered architecture to allow executor plugins that manage syntactic heterogeneity [2]
o Limited support for big linked geospatial data [3]
▪ The federation could include only one geospatial data source
• During ExtremeEarth, we have developed a new version of Semagrow. Now, Semagrow is the first
federation engine to be able to federate multiple geospatial data sources.
[1] A. Charalambidis, A. Troumpoukis, S. Konstantopoulos: SemaGrow: optimizing federated SPARQL queries. In SEMANTICS 2015, Vienna, Austria, September 15-17, 2015
[2] S. Konstantopoulos, A. Charalambidis, G. Mouchakis, A. Troumpoukis, J. Jakobitsch, V. Karkaletsis: Semantic Web Technologies and Big Data Infrastructures: SPARQL Federated
Querying of Heterogeneous Big Data Stores. In ISWC 2016 (Posters & Demos), Kobe, Japan, October 17-21, 2016
[3] A. Davvetas, I. Klampanos, S. Andronopoulos, G. Mouchakis, S. Konstantopoulos, A. Ikonomopoulos, V. Karkaletsis: Big Data Processing and Semantic Web Technologies for Decision
Making in Hazardous Substance Dispersion Emergencies. In ISWC 2017 (Posters, Demos & Industry Tracks), Vienna, Austria, October 21-57, 2017

7
The Semagrow architecture
• Source selector Identifies which of the federated sources refer to which parts of the query.
• Query planner Constructs an efficient query execution plan
• Query executor Evaluates the query execution plan and returns the results to the client.

8
Source Selector
● Identifies which of the federated sources refer to which parts of the query
● Should exclude as many redundant sources as possible, but without removing any nevecssary sources
● Combines two mechanisms:
○ Thematic data:
■ Extended with all the state-of-the-art source selection methods
● Predicate and Class metadata
● ASK queries (and cache)
● URI-Prefix-based source selection
○ Geospatial data:
■ A novel approach that targets geospatial data sources [4]
● Annotate all federated data sources with a bounding polygon
● Use such a summary to filter out sources that refer to irrelevant areas
[4] A. Troumpoukis, S. Konstantopoulos, N. Prokopaki-Kostopoulou: A Geospatial Source Selector for Federated GeoSPARQL Querying, To be submitted in SWJ.

9
Geospatial source selection
Method
• Each data source is tagged with a bounding polygon that contains all geometries of the source.
• For each triple pattern of the form ?x geo:asWKT ?y, prune the set of sources obtained by the wrapped
source selectors w.r.t. to its relevant geospatial filters and the bounding polygons of the set of sources.
o Geospatial selections: geospatial filters with one free variable
▪ ?s geo:asWKT ?x.
FILTER (geof:sfIntersects(
?x, KNOWN_WKT))
▪ If the border of a candidate source for the pattern is disjoint
from KNOWN_WKT then the source is irrelevant.
o Geospatial joins: geospatial filters with two free variables
▪ ?s1 geo:asWKT ?x.
?s2 geo:asWKT ?y.
FILTER (geof:sfWithin(
?x,?y))
▪ If the border of a candidate source for the first pattern is
disjoint with the borders of all candidate sources for the
second pattern, then the former source is irrelevant.
• We consider standard spatial relations (all apart from disjoint) and within-distance comparisons.

10
Query Planner
• Constructs an efficient query execution plan
• Uses endpoint statistics and dynamic-programming to find an optimal query plan.
• Federated geospatial joins: Bind join strategy with filter-pushdown optimization
o Reduction of the communication cost
o Geospatial functions are evaluated faster in the sources (spatial index)
• Evaluation of complex thematic queries:
o Examples;
▪ Subqueries (inner SELECT queries)
▪ ORDER BY, LIMIT 1
▪ FILTER NOT EXISTS
o Such queries appear in the Use-cases (Data-Validation of Land Usage Data)

11
Query Executor
• Evaluates the query execution plan and returns the results to the client.
o provides a mechanism for issuing queries to the remote endpoints
o provides an implementation of all geospatial operators that may appear in the plan
▪ GeoSPARQL, stSPARQL functions
• PostGIS connector:
o Semagrow allows executor plugins for non-SPARQL endpoints
o a plugin for communicating directly with PostGIS databases with shapefile data that contain
geometric shapes exclusively.
• Optimization of federated geospatial within-distance joins [5]
o Insert additional geospatial filters in the source queries
o Filter out shapes that are “too-far away” using the spatial index of the source
[5] A. Troumpoukis, S. Konstantopoulos, N. Prokopaki-Kostopoulou: A Geospatial Join Optimization for Federated GeoSPARQL Querying, To be submitted in ESWC2022.

12
Geospatial join optimization
Method
• Situation: bind join with filter pushdown optimization. Example:
?s1 geo:asWKT ?x .
?s2 geo:asWKT ?y .
FILTER (geof:distance(?x, ?y, uom:metre) < 10).
• Such queries appear frequently in the use cases
• The within-distance operation is computationally expensive: It cannot be
answered from the spatial index.
• Intuition:
o ?x is bound from the left part of the federated join, thus the filter during the query execution phase looks like this:
FILTER (geof:distance(
KNOWN_WKT, ?y, uom:metre) < 10).
o To help with the evaluation of the remote query by the federated endpoint, we can add the filter
FILTER (geof:sfIntersects(?y, CONSTRUCTED_BOX))
where CONSTRUCTED_BOXis equal to the bounding box of the buffer of size D around KNOWN_WKT.

13
Semagrow in ExtremeEarth
• Integration within Hopsworks
• Provides an extra layer over big linked
geospatial data store Strabo2
• Can be used to combine the data stored
in Strabo2 with additional external
geospatial endpoints.

15
History
• During the benchmarking activities of the FP7 SemaGrow project, we were faced with the need of a
framework that would help us for conducting experiments.
• Originally developed in H2020 BigDataEurope:
o Docker Containerization to abstract from the installation intricacies of each system
• During ExtremeEarth we explored this idea even further…

16
The KOBE Open Benchmarking Engine
• KOBE is a framework for benchmarking federated query engines.
• Features:
o Automation of the various tasks:
deployment, initialization of dataset servers and federation engines,
experiment execution
o Reproducibility in different environments:
each component in its own Docker container
o Declarative specifications:
formalism that hides from the user the details of provisioning and
orchestrating
o Simulating real-life scenarios:
network delays (dataset server latency limitations)
o Results presentation:
collection of logs and visualization in a WebUI
o Extensibility:
supports the integration of new benchmarks, new federators and
new remote dataset servers

17
The KOBE Open Benchmarking Engine (cont.)
• Re-engineered KOBE into 3 subsystems (Deployment, Networking, Logging)
• Technologies used: Docker, Kubernetes for orchestration, Istio for simulating delays, EFK stack for logs
• Command line interface for control, Kibana dashboards for viewing the results
[6] C. Kostopoulos, G. Mouchakis, A. Troumpoukis, N. Prokopaki-Kostopoulou, A. Charalambidis, S. Konstantopoulos: KOBE: Cloud-Native Open Benchmarking Engine for
Federated Query Processors. In ESWC 2021: 664-679
[7] C. Kostopoulos, G. Mouchakis, N. Prokopaki-Kostopoulou, A. Troumpoukis, A. Charalambidis, S. Konstantopoulos: KOBE: Cloud-native Open Benchmarking Engine for
Federated Query Processors. In ISWC (Demos/Industry) 2020: 325-330

18
The KOBE Open Benchmarking Engine (cont.)
• Dataset servers: Virtuoso, Strabo2, Federation Engines: Semagrow, FedX
• Benchmarks: Fedbench, LargeRDFBench, OPFbench, Geographica2, Geofedbench.
• Detailed Documentation (step by step instructions for getting started, using and extending KOBE).
https://semagrow.github.io/kobe/ (publicly available)

20
Combining Snow-cover data with Crop-type data
for Food Security
Datasets
• 3 data layers that cover Austria:
o Administrative, Snow cover, Crop type data
o each layer is partitioned geospatially
• Each dataset contains a single thematic layer and
refers to a specific polygonal area.
• 4.5 million triples, ~4GB of data in N-triples
• 34 GeoSPARQL endpoints.
We envisage that Austrian state governments publish crop datasets
for their own area of responsibility; and a further (different) entity
publishes a snow cover dataset that ignores state boundaries and
publishes its datasets according to a geographical grid.
Example: All snow-covered crops within a specific
area of interest (shown in red) appear only in 2 of the
total 12 datasets.

21
for Food Security
Queries
Queries
Q1 municipalities intersecting a given polygon
Q2 snow-covered potato fields intersecting a given polygon
Q3 potato fields within 5km from snow and intersecting a given polygon
Q4 snow area within 5km from a given municipality
Q5 potato fields within a given municipality
Q6 snow-covered potato fields within given municipality
Q7 potato fields within 5km from snow and within a given municipality

22
for Food Security
Experimental results
#layers
query processing time
geo-poly geo-appr them
Q1 1 0.200 0.205 0.180
Q2 2 0.985 0.525 0.755
Q3 2 5.245 1.215 1.810
Q4 2 8.785 7.940 9.025
Q5 2 0.605 0.445 0.520
Q6 3 15.535 n/a n/a
Q7 3 39.670 n/a n/a
• them: no geospatial metadata - geo-poly and
geo-appr use geospatial metadata - geo-poly
has more precise boundaries than geo-appr.
• Q1 is the easiest (1 data layer) them is faster..
• Q6-Q7 are the most difficult, (3 data layers).
only geo-poly can evaluate the queries.
• Q2-Q5 difficulty is in between (2 data layers).
geo-appr is the preferred (geo-poly too much
time in source selection, them spends more
time in planning, execution)

23
Validating Land-Usage Data
Datasets
• Austrian Land Parcel Identification System (INVEKOS)
o crop parcels in Austria and the owners' self-declaration about the
crops grown in each parcel
• Land Use and Cover Area Survey (LUCAS)
o agro-environmental and soil data by field observation of
geographically referenced points
• Task: Validate crop-type data of INVEKOS using LUCAS
• Crop-type map provided by UNITN
• 14.1 million triples, ~4GB of data in N-triples format

24
Queries
Queries
Q1
given a LUCAS instance, return the closest
INVEKOS instance if it is within 10 meters
and their crop types match
positive
validation
Q2
given a LUCAS instance, return the closest
INVEKOS instance if it is within 10 meters
and their crop types do not match
negative
validation
Q3
given a LUCAS instance, return it if there is
no closest INVEKOS instance within 10 meters
irrelevant
Example: 3 ground observations located in
the roads adjacent to field parcels, used for
crop-type validation of the field dataset. 2
of them (the green ones) provide a positive
and the other one provides a negative
validation.

25
Experimental results
• PostGIS: all data in a single PostGIS. semagrow-std without within-distance optimization, semagrow-opt
with the optimization.
• Semagrow without optimization is slower but similar to standalone PostGIS.
• Optimized Semagrow is faster by two orders of magnitude.
#queries
query execution time
PostGIS semagrow-std semagrow-opt
total average total average total average
Q1 2488 54 hours 78.6 sec 83 hours 120 sec 106 mins 2.6 sec

26
• Semagrow endpoint in Hopsworks-TEP infrastructure.
o Endpoint 1:
▪ Strabo2 endpoint already deployed in Hopsworks-TEP infrastructure
▪ Contains Extreme Earth data
o Endpoint 2:
▪ Public Strabon endpoint
▪ Contains GADM of Germany
• Federated query for demo (“Query1” for a specific administrative region)
o Regions affected by precipitation in Quarter 2 of 2021 that was lower then -15% of the
normal rainfall and that are equipped with irrigation and intersect with state of Branderburg
o Semagrow operates as follows:
▪ Retrieve the WKT of the state of Brandenburg from Endpoint 2
▪ Retrieve all relevant EE that are found within WKT from Endpoint 1
o Returns 447 Results.
Combining EE data with public endpoints
Datasets and Queries

27
Conclusions
• We have developed a new version of Semagrow, Now, Semagrow is the first federation engine to be
able to federate multiple big linked geospatial data sources.
• We have developed a new version of the KOBE benchmarking engine, which is a useful tool for
benchmarking federated query processors.
• We have applied Semagrow to several exercises and use cases from the Extreme Earth project
(Land-usage data validation, Combination of snow-cover and crop-type data for food security, etc.)

Big Linked Data Federation - ExtremeEarth Open Workshop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Big Linked Data Federation - ExtremeEarth Open Workshop

Similar to Big Linked Data Federation - ExtremeEarth Open Workshop (20)

More from ExtremeEarth

More from ExtremeEarth (13)

Recently uploaded

Recently uploaded (20)

Big Linked Data Federation - ExtremeEarth Open Workshop