The tutorial will be presented on May 27 2012 at the 9th Extended Semantic Web Conference (ESWC 2012).
Short description of the tutorial:
The tutorial describes the traditional optimize-then-execute paradigm implemented in existing RDF engines and its main drawbacks when a large volume of data needs to be remotely accessed. As a solution to overcome limitations of current query processing approaches, we will present existing adaptive query processing techniques defined in the context of database management systems, and their applicability to the Semantic Web. Also, we will describe current solutions that have been proposed in the context of the Semantic Web to access remote data. The target audience includes researchers and practitioners that develop or use query engines to consume Linked and Big Data through SPARQL endpoints. The participants will learn limitations of existing RDF query engines and how current techniques can be extended to access remote data from Linked Data sets, and hide delays caused by unpredictable data transfers and datasets availability. A hands-on session will allow attendees to evaluate the performance and robustness of existing approaches.
1. Adaptive Semantic Data Management Techniques for
Federations of Endpoints -Tutorial Description
Maria-Esther Vidal1 , Edna Ruckhaus1
Maribel Acosta1,2 , Cosmin Basca3 , Gabriela Montoya1
1
Universidad Sim´ n Bol´var, Venezuela
o ı
{mvidal, ruckhaus, macosta,gmontoya}@ldc.usb.ve
2
Institute AIFB, Karlsruhe Institute of Technology, Germany
Maribel.Acosta@aifb.uni-karlsruhe.de
3
Department of Informatics, University of Zurich, Switzerland
basca@ifi.uzh.ch
January 20, 2012
Abstract
Emerging technologies that support networks of sensors or mobile smartphones are making
available an extremely large volume of data or Big Data; additionally, in the context of the
Cloud of Linked Data, a large number of huge RDF linked datasets have become available, and
this number keeps growing. Simultaneously, although scalable and efficient RDF engines that
follow the traditional optimize-then-execute paradigm have been developed to locally access
RDF data, SPARQL endpoints have been implemented for remote query processing. Given
the size of existing datasets, lack of statistics to describe available sources, and unpredictable
conditions of remote queries, existing solutions are still insufficient. First, the most efficient
RDF engines rely their query processing algorithms on physical access and storage structures
that are locally stored; however, because of the size of existing linked datasets, loading the
data and their links is not always feasible. Second, remote linked data query processing can
be extremely costly because of the lack of query planning; also, current techniques are not
adaptable to unpredictable data transfers or data availability, thus, executions can be unsuccess-
ful. To overcome these limitations, query physical operators and execution engines need to be
able to access remote data and adapt query execution schedulers to data availability. In this
tutorial we present the basis of adaptive query processing frameworks defined in the database
area, and their applicability in the Linked and Big Data context where data can be accessed
through SPARQL endpoints. This tutorial targets any conference attendee who wants to know
limitations of existing RDF engines, adaptive query processing techniques, and how traditional
RDF data management approaches can be well-suitable to runtime conditions, and extended to
access a large volume of data distributed in federations of SPARQL endpoints. The first edition
of this tutorial was presented at ESWC 2011.
1 Tutorial Description
1.1 Aims and Target Audience
The tutorial describes the traditional optimize-then-execute paradigm implemented in existing RDF
engines and its main drawbacks when a large volume of data needs to be remotely accessed. As a
solution to overcome limitations of current query processing approaches, we will present existing
adaptive query processing techniques defined in the context of database management systems, and
1
2. their applicability to the Semantic Web. Also, we will describe current solutions that have been
proposed in the context of the Semantic Web to access remote data. The target audience includes
researchers and practitioners that develop or use query engines to consume Linked and Big Data
through SPARQL endpoints. The participants will learn limitations of existing RDF query engines
and how current techniques can be extended to access remote data from Linked datasets, and hide
delays caused by unpredictable data transfers and datasets availability. A hands-on session will
allow attendees to evaluate the performance and robustness of existing approaches.
1.2 Presentation Method and Technical Requirements
We propose a full-day tutorial; first theoretical issues will be presented; then, a hands-on session
will allow attendees to evaluate existing query processing approaches and determine pros and cons
of each one. The morning session will comprise a short introduction, three lectures and one coffee-
break of fifteen minutes. In the introduction the core concepts of a data management engine will
be presented. Next, in the first and second lectures, query execution and optimization techniques
of the classical approach of optimize-then-execute paradigm will be described; limitations of exist-
ing SPARQL endpoints and existing approaches to query Linked and Big Data will be illustrated.
Then, adaptive query processing techniques proposed in the context of Databases and the Seman-
tic Web will be presented in the third lecture. In the afternoon session, applicability of existing
approaches to consume Linked data will be described and an evaluation of state-of-the-art engines
will be conducted. We expect participants to have just a basic understanding of RDF and SPARQL.
2 Justification for the tutorial in ESWC 2012
In the context of the Cloud of Linked Data, a large number of diverse datasets have become avail-
able, and an exponential growth of the published data and links has occurred during the last years.
Billions of triples from life science research groups, government agencies, Wikipedia or entertain-
ment organizations, currently comprise the Cloud.
Following the guidelines to publish and link data on the Cloud, a great number of available
SPARQL endpoints that support remote query processing to linked data have become available,
and this number keeps growing. Additionally, to scale up to the size of existing datasets, RDF
engines have implemented storage and access structures and query processing techniques for local
query processing. However, although the semantic data management community actively works
on more suitable linked data query processing techniques, access to the Cloud of Linked datasets
is still limited and insufficient because data have to be locally stored or some SPARQL endpoints
only support very light-weight use. To successfully execute real-world queries, in addition to access
remote data, existing query solutions have to be able to adapt query execution schedulers to data
availability. This tutorial aims to illustrate limitations of existing approaches and how they can be
extended to be well-suitable for remote query processing and runtime conditions. We consider that
this tutorial is ideally co-located with ESWC 2012, because research institutions that traditionally
attend ESWC, have an active contribution in the domain of RDF data management. Particularly,
one of the conference research tracks is on semantic data management, being query processing of
semantic data one of the topics of interests. Thus, many of the conference attendees could see the
tutorial as a place to discuss possible solutions to current semantic data management limitations.
2
3. 3 Outline of the Tutorial
The goal of the tutorial is to highlight limitations of existing RDF query engines, introduce the basic
concepts of existing adaptive query processing techniques and how they can be used to effectively
and efficiency access SPARQL endpoints.
3.1 Content
The tutorial will cover traditional data management solutions that implement the optimize-then-
execute paradigm, and their pros and cons for Linked Data query processing; novel storage and
access data structures, and query optimization and execution techniques implemented by state-of-
the-art RDF engines will be described. Then, adaptive frameworks defined in the database area to
manage remote query processing, will be analyzed; adaptive operators such as symmetric hash joins
(binary and n-ary), routing operators, and adaptive engines will be studied. Finally, applicability of
adaptive techniques will be illustrated with existing query processing engines for federations of
SPARQL endpoints. Attendees will evaluate the performance and robustness of state-of-the-art
approaches during a hands-on session; observed results will be discussed with the attendees.
3.2 Schedule
Morning Session
Introduction (20 minutes):
• Traditional data management system architecture and its main components.
• Basic terminology.
Lecture 1-The Optimize-then-Execute Paradigm (50 minutes):
• Cost-based optimization techniques.
• Traditional iterator model architecture.
• Centralized data management physical operators.
• Centralized data management query engines.
Lecture 2-Existing RDF Engines (50 minutes):
• Query optimization and execution techniques in existing RDF engines like RDF-
3X [3].
• SPARQL endpoints and their execution model.
• The SPARQL 1.1 Federation extension [6].
• RDF engines for query processing against federations of SPARQL endpoints; ap-
proaches as FedX [5] and ARQ [7] will be studied.
Coffee-Break (15 minutes)
Lecture 3-Adaptive Query Processing Techniques (100 minutes):
• Intra-operators solutions; adaptive physical operators: symmetric hash joins, n-ary
joins.
• Inter-operators solutions; Eddy operators, query processing schedulers, and routing
policies.
• Adaptive query engines.
Lunch (120 minutes)
3
4. Afternoon Session
Lecture 4: Adaptive Approaches for Federations of SPARQL endpoints(50 minutes):
• Requirements for query processing in Federations of SPARQL endpoints.
• Existing benchmarks for evaluating query processing engines for Federations of
SPARQL endpoints, e.g., FedBench [4].
• Adaptive query processing engines for Federations of endpoints; approaches as
ANAPSID [1] and Avalanche [2] will be studied.
Coffee-Break (15 minutes)
Hands-on Session: RDF Storage Systems Evaluation (100 minutes): existing benchmarks
will be used to evaluate performance and robustness of state-of-the-art solutions; ARQ,
FedX, ANAPSID and Avalanche will be analyzed.
Analysis and Discussion of the Evaluation Results (30 minutes): results of the evaluation
will be analyzed and discussed with the attendees.
4 Tutorial Former Editions
The first edition of the tutorial named Adaptive Semantic Data Management Techniques for Linked
Data, was held at ESWC 2011(http://www.eswc2011.org/content/tutorials); it was a half day tu-
torial that did not include a hands-on session and the evaluation of state-of-the-art approaches as
Avalanche, ARQ, ANAPSID and FedX.
5 Information of Presenters
Edna Ruckhaus is a Full Professor of the Computer Science department at the Universidad Sim´ n o
Bol´var, Venezuela since 1998, where she has taught several Database courses at undergrad-
ı
uate level. Visiting scholar of the research group Mindswap (Maryland Information and Net-
work Dynamic Lab Semantic Web Agents Project), 2004-2005. Over 20 publications in in-
ternational and national conferences and journals. She has been reviewer and has participated
in the Program Committee of several International Conferences. Member of the Organizing
Committee of the Workshop on Applications of Logic Programming to the Semantic Web
and Semantic Web Services (ALPSWS2007) co-located with the International Conference on
Logic Programming. Co-Chair of the Organizing Committee of the ESWC 2011 and 2012
Workshops on Resource Discovery; she co-organized and co-lectured the tutorial on Adaptive
Semantic Data Management Techniques for Linked Data at ESWC 2011.
Maria-Esther Vidal is a Full Professor of the Computer Science department at the Universidad
Universidad Sim´ n Bol´var, Venezuela, where she has taught several Database and Semantic
o ı
Web courses at undergraduate and graduate level. Prof. Vidal has been also a Research Asso-
ciate and Visiting Researcher at the Institute of Advanced Computer Studies of the University
of Maryland, and Visiting Professor at Universidad Polit´ cnica de Catalunya, University of
e
Laguna Spain, and Leipzig, Germany. She has participated in several international projects
supported by NFS (USA), AECI (Spain) and CNRS (France), and advised six PhD students
and more than 55 master and undergraduate students. Professor Vidal has published more
than 60 papers in International Conferences and Journals of the Database and The Semantic
Web areas. She has been reviewer and has participated in the Program Committee of sev-
eral International Journals and Conferences. Co-chair of Workshop on Resource Discovery
4
5. (RED2010) and accompanying professor of On the Move Academy (OTMa). Co-Chair of
the Organizing Committee of the ESWC 2011 and 2012 Workshops on Resource Discov-
ery; she co-organized and co-lectured the tutorial on Adaptive Semantic Data Management
Techniques for Linked Data at ESWC 2011.
Maribel Acosta is a PhD student at Institute AIFB, Karlsruhe Institute of Technology, Germany.
She has Master on Computer Science from the Universidad Sim´ n Bol´var where she was a
o ı
Teaching Assistant and has taught Logic, Discrete Math, and Databases labs at the undergrad-
uate level. She has published seven publications in international conferences and workshops.
Her topics of interests are Adaptive Query Execution techniques for Linked and Big Data.
Cosmin Basca is a PhD student at the University of Zurich, Department of Informatics, Switzer-
land. He holds a master in Computer Science from “Lucian Blaga” University of Sibiu,
Romania where he did research in image processing and computer vision. Later, while being
part of Digital Enterprise Research Institute in Galway, Ireland he focused his research on Se-
mantic Web, specifically Semantic Data Management. His research interests include among
others: large scale distributed graph data management systems and algorithms and Linked
Data.
Gabriela Montoya is a Lecturer of the Computer Science Department at the Universidad Sim´ n o
Bol´var, where she has taught Logic, Algorithms and Programming Languages courses and
ı
labs at undergraduate level. She has Master on Computer Science from the Universidad
Sim´ n Bol´var and currently, she is a doctoral student at the same university; her topics of
o ı
interests are Data Integration and Query Processing techniques in Emerging Infrastructures.
References
[1] M. Acosta, M.-E. Vidal, T. Lampo, J. Castillo, and E. Ruckhaus. ANAPSID: AN Adaptive
query ProcesSing engIne for sparql enDpoints. In Proceedings of the International Semantic
Web Conference (ISWC), 2011.
[2] C. Basca and A. Bernstein. Avalanche: Putting the Spirit of the Web back into Semantic Web
Querying. In SSWS2010 Workshop, Shanghai, China, 2010.
[3] T. Neumann and G. Weikum. RDF-3X: a RISC-style engine for RDF. Proc. VLDB, 1(1), 2008.
[4] M. Schmidt, O. Gorlitz, P. Haase, A. Schwarte, G. Ladwig, and T. Tran. Fedbench: A bench-
mark suite for federated semantic data query processing. International Semantic Web Confer-
ence, 2011.
[5] A. Schwarte, P. Haase, K. Hose, R. Schenkel, and M. Schmidt. Fedx: Optimization techniques
for federated query processing on linked data. In International Semantic Web Conference (1),
pages 601–616, 2011.
[6] E. P. Steve Harris, Andy Seaborne. SPARQL 1.1 Query Language, June 2010.
[7] M. Stocker, A. Seaborne, A. Bernstein, C. Kiefer, and D. Reynolds. SPARQL basic graph
pattern optimization using selectivity estimation. In International Semantic Web Conference
(ISWC), Beijing, China, 2008. ACM.
5