4. 4
Federated query processors
• Systems that seamlessly integrate data from multiple
remote dataset servers.
• Receive a query, issue the necessary subqueries in the
remote servers, combine the results accordingly, and
present the result to the user.
• Used thoroughly in Linked-Data; there exist many data
providers that publish their datasets through SPARQL
endpoints.
5. 5
Semagrow federated query processor
• Semagrow is an open source dynamic data integration system:
o makes the best out of all public data, regardless of their size, update rate, and schema.
o presents to client applications a single, unified SPARQL endpoint that federates multiple data sources.
o manages both syntactic and semantic heterogeneity.
• The federated data sources may serve data that use different vocabularies and codelists
o Semagrow dynamically transforms responses from the different data sources to match the vocabularies used in the
query.
• The federated data sources may offer SPARQL, GeoSPARQL, SQL, or CQL (CassandraQL) APIs.
o Semagrow processes SPARQL queries and appropriately re-writes the sub-queries for each data source.
o Semagrow fills in the missing expressivity, e.g. arbitrary joins for CQL sources
6. 6
History
• Originally developed in FP7 SemaGrow:
o SPARQL endpoint federation engine [1]
o Multi-threaded application, deployed directly on server or VM
o Dynamic vocabulary mapping
• Extended in H2020 BigDataEurope:
o Containerization and ability to deploy on Cloud infrastructures
o Re-engineered architecture to allow executor plugins that manage syntactic heterogeneity [2]
o Limited support for big linked geospatial data [3]
▪ The federation could include only one geospatial data source
• During ExtremeEarth, we have developed a new version of Semagrow. Now, Semagrow is the first
federation engine to be able to federate multiple geospatial data sources.
[1] A. Charalambidis, A. Troumpoukis, S. Konstantopoulos: SemaGrow: optimizing federated SPARQL queries. In SEMANTICS 2015, Vienna, Austria, September 15-17, 2015
[2] S. Konstantopoulos, A. Charalambidis, G. Mouchakis, A. Troumpoukis, J. Jakobitsch, V. Karkaletsis: Semantic Web Technologies and Big Data Infrastructures: SPARQL Federated
Querying of Heterogeneous Big Data Stores. In ISWC 2016 (Posters & Demos), Kobe, Japan, October 17-21, 2016
[3] A. Davvetas, I. Klampanos, S. Andronopoulos, G. Mouchakis, S. Konstantopoulos, A. Ikonomopoulos, V. Karkaletsis: Big Data Processing and Semantic Web Technologies for Decision
Making in Hazardous Substance Dispersion Emergencies. In ISWC 2017 (Posters, Demos & Industry Tracks), Vienna, Austria, October 21-57, 2017
7. 7
The Semagrow architecture
• Source selector Identifies which of the federated sources refer to which parts of the query.
• Query planner Constructs an efficient query execution plan
• Query executor Evaluates the query execution plan and returns the results to the client.
8. 8
Source Selector
● Identifies which of the federated sources refer to which parts of the query
● Should exclude as many redundant sources as possible, but without removing any nevecssary sources
● Combines two mechanisms:
○ Thematic data:
■ Extended with all the state-of-the-art source selection methods
● Predicate and Class metadata
● ASK queries (and cache)
● URI-Prefix-based source selection
○ Geospatial data:
■ A novel approach that targets geospatial data sources [4]
● Annotate all federated data sources with a bounding polygon
● Use such a summary to filter out sources that refer to irrelevant areas
[4] A. Troumpoukis, S. Konstantopoulos, N. Prokopaki-Kostopoulou: A Geospatial Source Selector for Federated GeoSPARQL Querying, To be submitted in SWJ.
9. 9
Geospatial source selection
Method
• Each data source is tagged with a bounding polygon that contains all geometries of the source.
• For each triple pattern of the form ?x geo:asWKT ?y, prune the set of sources obtained by the wrapped
source selectors w.r.t. to its relevant geospatial filters and the bounding polygons of the set of sources.
o Geospatial selections: geospatial filters with one free variable
▪ ?s geo:asWKT ?x.
FILTER (geof:sfIntersects(
?x, KNOWN_WKT))
▪ If the border of a candidate source for the pattern is disjoint
from KNOWN_WKT then the source is irrelevant.
o Geospatial joins: geospatial filters with two free variables
▪ ?s1 geo:asWKT ?x.
?s2 geo:asWKT ?y.
FILTER (geof:sfWithin(
?x,?y))
▪ If the border of a candidate source for the first pattern is
disjoint with the borders of all candidate sources for the
second pattern, then the former source is irrelevant.
• We consider standard spatial relations (all apart from disjoint) and within-distance comparisons.
10. 10
Query Planner
• Constructs an efficient query execution plan
• Uses endpoint statistics and dynamic-programming to find an optimal query plan.
• Federated geospatial joins: Bind join strategy with filter-pushdown optimization
o Reduction of the communication cost
o Geospatial functions are evaluated faster in the sources (spatial index)
• Evaluation of complex thematic queries:
o Examples;
▪ Subqueries (inner SELECT queries)
▪ ORDER BY, LIMIT 1
▪ FILTER NOT EXISTS
o Such queries appear in the Use-cases (Data-Validation of Land Usage Data)
11. 11
Query Executor
• Evaluates the query execution plan and returns the results to the client.
o provides a mechanism for issuing queries to the remote endpoints
o provides an implementation of all geospatial operators that may appear in the plan
▪ GeoSPARQL, stSPARQL functions
• PostGIS connector:
o Semagrow allows executor plugins for non-SPARQL endpoints
o a plugin for communicating directly with PostGIS databases with shapefile data that contain
geometric shapes exclusively.
• Optimization of federated geospatial within-distance joins [5]
o Insert additional geospatial filters in the source queries
o Filter out shapes that are “too-far away” using the spatial index of the source
[5] A. Troumpoukis, S. Konstantopoulos, N. Prokopaki-Kostopoulou: A Geospatial Join Optimization for Federated GeoSPARQL Querying, To be submitted in ESWC2022.
12. 12
Geospatial join optimization
Method
• Situation: bind join with filter pushdown optimization. Example:
?s1 geo:asWKT ?x .
?s2 geo:asWKT ?y .
FILTER (geof:distance(?x, ?y, uom:metre) < 10).
• Such queries appear frequently in the use cases
• The within-distance operation is computationally expensive: It cannot be
answered from the spatial index.
• Intuition:
o ?x is bound from the left part of the federated join, thus the filter during the query execution phase looks like this:
FILTER (geof:distance(
KNOWN_WKT, ?y, uom:metre) < 10).
o To help with the evaluation of the remote query by the federated endpoint, we can add the filter
FILTER (geof:sfIntersects(?y, CONSTRUCTED_BOX))
where CONSTRUCTED_BOXis equal to the bounding box of the buffer of size D around KNOWN_WKT.
13. 13
Semagrow in ExtremeEarth
• Integration within Hopsworks
• Provides an extra layer over big linked
geospatial data store Strabo2
• Can be used to combine the data stored
in Strabo2 with additional external
geospatial endpoints.
15. 15
History
• During the benchmarking activities of the FP7 SemaGrow project, we were faced with the need of a
framework that would help us for conducting experiments.
• Originally developed in H2020 BigDataEurope:
o Docker Containerization to abstract from the installation intricacies of each system
• During ExtremeEarth we explored this idea even further…
16. 16
The KOBE Open Benchmarking Engine
• KOBE is a framework for benchmarking federated query engines.
• Features:
o Automation of the various tasks:
deployment, initialization of dataset servers and federation engines,
experiment execution
o Reproducibility in different environments:
each component in its own Docker container
o Declarative specifications:
formalism that hides from the user the details of provisioning and
orchestrating
o Simulating real-life scenarios:
network delays (dataset server latency limitations)
o Results presentation:
collection of logs and visualization in a WebUI
o Extensibility:
supports the integration of new benchmarks, new federators and
new remote dataset servers
17. 17
The KOBE Open Benchmarking Engine (cont.)
• Re-engineered KOBE into 3 subsystems (Deployment, Networking, Logging)
• Technologies used: Docker, Kubernetes for orchestration, Istio for simulating delays, EFK stack for logs
• Command line interface for control, Kibana dashboards for viewing the results
[6] C. Kostopoulos, G. Mouchakis, A. Troumpoukis, N. Prokopaki-Kostopoulou, A. Charalambidis, S. Konstantopoulos: KOBE: Cloud-Native Open Benchmarking Engine for
Federated Query Processors. In ESWC 2021: 664-679
[7] C. Kostopoulos, G. Mouchakis, N. Prokopaki-Kostopoulou, A. Troumpoukis, A. Charalambidis, S. Konstantopoulos: KOBE: Cloud-native Open Benchmarking Engine for
Federated Query Processors. In ISWC (Demos/Industry) 2020: 325-330
18. 18
The KOBE Open Benchmarking Engine (cont.)
• Dataset servers: Virtuoso, Strabo2, Federation Engines: Semagrow, FedX
• Benchmarks: Fedbench, LargeRDFBench, OPFbench, Geographica2, Geofedbench.
• Detailed Documentation (step by step instructions for getting started, using and extending KOBE).
https://semagrow.github.io/kobe/ (publicly available)
20. 20
Combining Snow-cover data with Crop-type data
for Food Security
Datasets
• 3 data layers that cover Austria:
o Administrative, Snow cover, Crop type data
o each layer is partitioned geospatially
• Each dataset contains a single thematic layer and
refers to a specific polygonal area.
• 4.5 million triples, ~4GB of data in N-triples
• 34 GeoSPARQL endpoints.
We envisage that Austrian state governments publish crop datasets
for their own area of responsibility; and a further (different) entity
publishes a snow cover dataset that ignores state boundaries and
publishes its datasets according to a geographical grid.
Example: All snow-covered crops within a specific
area of interest (shown in red) appear only in 2 of the
total 12 datasets.
21. 21
Combining Snow-cover data with Crop-type data
for Food Security
Queries
Queries
Q1 municipalities intersecting a given polygon
Q2 snow-covered potato fields intersecting a given polygon
Q3 potato fields within 5km from snow and intersecting a given polygon
Q4 snow area within 5km from a given municipality
Q5 potato fields within a given municipality
Q6 snow-covered potato fields within given municipality
Q7 potato fields within 5km from snow and within a given municipality
22. 22
Combining Snow-cover data with Crop-type data
for Food Security
Experimental results
#layers
query processing time
geo-poly geo-appr them
Q1 1 0.200 0.205 0.180
Q2 2 0.985 0.525 0.755
Q3 2 5.245 1.215 1.810
Q4 2 8.785 7.940 9.025
Q5 2 0.605 0.445 0.520
Q6 3 15.535 n/a n/a
Q7 3 39.670 n/a n/a
• them: no geospatial metadata - geo-poly and
geo-appr use geospatial metadata - geo-poly
has more precise boundaries than geo-appr.
• Q1 is the easiest (1 data layer) them is faster..
• Q6-Q7 are the most difficult, (3 data layers).
only geo-poly can evaluate the queries.
• Q2-Q5 difficulty is in between (2 data layers).
geo-appr is the preferred (geo-poly too much
time in source selection, them spends more
time in planning, execution)
23. 23
Validating Land-Usage Data
Datasets
• Austrian Land Parcel Identification System (INVEKOS)
o crop parcels in Austria and the owners' self-declaration about the
crops grown in each parcel
• Land Use and Cover Area Survey (LUCAS)
o agro-environmental and soil data by field observation of
geographically referenced points
• Task: Validate crop-type data of INVEKOS using LUCAS
• Crop-type map provided by UNITN
• 14.1 million triples, ~4GB of data in N-triples format
24. 24
Validating Land-Usage Data
Queries
Queries
Q1
given a LUCAS instance, return the closest
INVEKOS instance if it is within 10 meters
and their crop types match
positive
validation
Q2
given a LUCAS instance, return the closest
INVEKOS instance if it is within 10 meters
and their crop types do not match
negative
validation
Q3
given a LUCAS instance, return it if there is
no closest INVEKOS instance within 10 meters
irrelevant
Example: 3 ground observations located in
the roads adjacent to field parcels, used for
crop-type validation of the field dataset. 2
of them (the green ones) provide a positive
and the other one provides a negative
validation.
25. 25
Validating Land-Usage Data
Experimental results
• PostGIS: all data in a single PostGIS. semagrow-std without within-distance optimization, semagrow-opt
with the optimization.
• Semagrow without optimization is slower but similar to standalone PostGIS.
• Optimized Semagrow is faster by two orders of magnitude.
#queries
query execution time
PostGIS semagrow-std semagrow-opt
total average total average total average
Q1 2488 54 hours 78.6 sec 83 hours 120 sec 106 mins 2.6 sec
Q2 2488 54 hours 78.4 sec 82 hours 119 sec 99 mins 2.4 sec
Q3 2488 54 hours 78.6 sec 81 hours 117 sec 74 mins 1.8 sec
26. 26
• Semagrow endpoint in Hopsworks-TEP infrastructure.
o Endpoint 1:
▪ Strabo2 endpoint already deployed in Hopsworks-TEP infrastructure
▪ Contains Extreme Earth data
o Endpoint 2:
▪ Public Strabon endpoint
▪ Contains GADM of Germany
• Federated query for demo (“Query1” for a specific administrative region)
o Regions affected by precipitation in Quarter 2 of 2021 that was lower then -15% of the
normal rainfall and that are equipped with irrigation and intersect with state of Branderburg
o Semagrow operates as follows:
▪ Retrieve the WKT of the state of Brandenburg from Endpoint 2
▪ Retrieve all relevant EE that are found within WKT from Endpoint 1
o Returns 447 Results.
Combining EE data with public endpoints
Datasets and Queries
27. 27
Conclusions
• We have developed a new version of Semagrow, Now, Semagrow is the first federation engine to be
able to federate multiple big linked geospatial data sources.
• We have developed a new version of the KOBE benchmarking engine, which is a useful tool for
benchmarking federated query processors.
• We have applied Semagrow to several exercises and use cases from the Extreme Earth project
(Land-usage data validation, Combination of snow-cover and crop-type data for food security, etc.)