A presentation on my early work on the Mastro system. Some of this research is now part of the ontop system, some evolved into more optimised forms (also in ontop).
1. Mastro at Work: Experiences on
Ontology-based Data Access
Domenico Fabio Savo1, Domenico Lembo1,
Maurizio Lenzerini1, Antonella Poggi1,
Mariano Rodriguez-Muro2, Vittorio Romagnoli3,
Marco Ruzzi1, Gabriele Stella3
1 Sapienza Universit`a
di Roma
lastname@dis.uniroma1.it
2 Free University of
Bozen-Bolzano
rodriguez@inf.unibz.it
3 Banca Monte dei
Paschi di Siena
firstname.lastname@banca.mps.it
May, 2010
Mastro at Work Savo et. al.
2. Motivations
DL-Lite OBDA framework
OBDA
Integrated view, semantically rich
description, mapping for concep-
tual level and data sources. Ex-
ploiting reasoning to overcome in-
completeness
Data Source
Data Source
Data Source
Data Layer
Ontology Semantic
Layer
Queries
Mappings
Mastro at Work Savo et. al.
3. Motivations
DL-Lite OBDA framework
DL-Lite framework for OBDA
Components:
• A family of Ontology
Languages: DL-Lite.
• A mapping technique for
relational databases:
Virtual ABoxes
• Promising proposal.
• However, never evaluated in
‘the field’.
Data Source
Data Source
Data Source
Data Layer
Ontology Semantic
Layer
Queries
Mappings
Mastro at Work Savo et. al.
4. Motivations
The domain
• Joint project on OBDA by Banca Monte dei Paschi di
Siena (MPS), Free University of Bozen-Bolzano, and
SAPIENZA Universit`a di Roma.
• Clusters of Connected Customers (CCCs)
• Data is used in risk estimation in the process of granting
credit to bank customers
Mastro at Work Savo et. al.
5. Motivations
Problems and Solutions
• management is now completely entrusted to the expert
of the applications rather than to the domain experts.
Mastro at Work Savo et. al.
6. Motivations
Problems and Solutions
• management is now completely entrusted to the expert
of the applications rather than to the domain experts.
• OBDA has been then used for answering queries posed over
the CCCs ontology, not only aimed at easily extracting
relevant information but also to localize inconsistencies
and incompleteness in the data, as well as to devise new
data governance tasks.
Mastro at Work Savo et. al.
8. Mastro
The Mastro-OBDA plugin
A DL-Lite reasoner for the OBDA context that is able to take an
ontology with with mappings to a relational database (defining a
‘virtual Abox’) in order to provide the following services:
Mastro at Work Savo et. al.
9. Mastro
The Mastro-OBDA plugin
A DL-Lite reasoner for the OBDA context that is able to take an
ontology with with mappings to a relational database (defining a
‘virtual Abox’) in order to provide the following services:
Features
• Conjunctive Query Answering
• Epistemic Query Answering (EQL)
• Identification Constraints
• Epistemic Constraints
Mastro at Work Savo et. al.
10. Protege, OBDA and Mastro plugins
Protege 4 and the OBDA Plugin
Features
• Ontology definition
• Datasource and mapping
definition
• Interaction with
OBDA-reasoner (CQs,
Epistemic queries, etc.)
Mastro at Work Savo et. al.
12. MPS
Methodology
• Developed the Ontology independently from the source
• Tools used:
• interviews
• questionnaires
• existing documentation
• Developed over a period of 6 months
Mastro at Work Savo et. al.
14. Constraints
IDCs to impose complex business constraints
(id JuridicalCCC timestamp, relativeTo−
◦ inGrouping−
◦ inMembership ◦ ?Holding
◦ hasMembership−
)
• At the same time two juridical CCCs cannot comprise
customers that are lead members, i.e., are the holdings, of the
same company group.
Mastro at Work Savo et. al.
15. Constraints
IDCs to impose complex business constraints
(id JuridicalCCC timestamp, relativeTo−
◦ inGrouping−
◦ inMembership ◦ ?Holding
◦ hasMembership−
)
• At the same time two juridical CCCs cannot comprise
customers that are lead members, i.e., are the holdings, of the
same company group.
A total of 30 Identification Constraints
Mastro at Work Savo et. al.
16. Constraints
EQLCs to impose complex business constraints
EQLC( verify not exists (
SELECT jurCCC.jccc
FROM sparqltable(SELECT ?jccc
WHERE{ ?jccc rdf:type ’JuridicalCCC’ })jurCCC
WHERE jurCCC.jccc NOT IN (
SELECT withGroupLeader.jccc
FROM sparqltable(SELECT ?jccc, ?mem
WHERE{ ?cus rdf:type ’Customer’.
?cus :inMembership ?mem.?mem rdf:type ’Holding’.
?cus :inGrouping ?gr. ?gr :relativeTo ?jccc.
?jccc rdf:type ’JuridicalCCC’}) withGroupLeader ) ) )
• There does not exist a juridical CCC that does not comprise a
customer which is the holding member of a company group
Mastro at Work Savo et. al.
17. Constraints
EQLCs to impose complex business constraints
EQLC( verify not exists (
SELECT jurCCC.jccc
FROM sparqltable(SELECT ?jccc
WHERE{ ?jccc rdf:type ’JuridicalCCC’ })jurCCC
WHERE jurCCC.jccc NOT IN (
SELECT withGroupLeader.jccc
FROM sparqltable(SELECT ?jccc, ?mem
WHERE{ ?cus rdf:type ’Customer’.
?cus :inMembership ?mem.?mem rdf:type ’Holding’.
?cus :inGrouping ?gr. ?gr :relativeTo ?jccc.
?jccc rdf:type ’JuridicalCCC’}) withGroupLeader ) ) )
• There does not exist a juridical CCC that does not comprise a
customer which is the holding member of a company group
A total of 27 Epistemic Constraint
Mastro at Work Savo et. al.
18. OBDA Mappings
The Data Source
• Currently, MPS applications managing CCCs rely over a 15
million tuple database, stored in 12 relational tables under
IBM DB2 RDBMS
Source name Source Description Source size
GZ0001 Data on customers 3.463.083
GZ0002 Data on juridical connections between customers 157.280
GZ0003 Data on guarantee connection between customers 1.270.333
GZ0004 Data on economical connections between customers 104.033
GZ0005 Data on corporation connections between customers 1.021.779
GZ0006 Data on patrimonial connections between customers 809.321
GZ0007 Data on company groups 55.362
GZ0012 Customers loan information 5.966.948
GZ0015 Data on monitoring and reporting procedures 1.243
GZ0101 Data on membership of customers into CCCs 2.225.466
GZ0102 Information on CCCs 663.656
GZ0104 Data on bank credit coordinators for juridical CCCs 38.457
Mastro at Work Savo et. al.
19. OBDA Mappings
OBDA Mappings: Example
SELECT id cluster, timestamp val FROM GZ0102, GZ0007
WHERE GZ0102.validity code = ‘T’ AND GZ0102.id cluster <> 0
AND GZ0007.validity code = ‘T’ AND GZ0007.id group <> 0
AND GZ0102.id cluster = GZ0007.id group
JuridicalCCC(ccc(id cluster, timestamp val)),
timestamp(ccc(id cluster, timestamp val), timestamp val)
Mastro at Work Savo et. al.
20. OBDA Mappings
OBDA Mappings: Example
SELECT id cluster, timestamp val FROM GZ0102, GZ0007
WHERE GZ0102.validity code = ‘T’ AND GZ0102.id cluster <> 0
AND GZ0007.validity code = ‘T’ AND GZ0007.id group <> 0
AND GZ0102.id cluster = GZ0007.id group
JuridicalCCC(ccc(id cluster, timestamp val)),
timestamp(ccc(id cluster, timestamp val), timestamp val)
If the tuple (243, 24052009112341) is in ans(body) the we have
the following Virtual ABox assertions:
JuridicalCCC(gcc(243, 24052009112341))
timestamp(gcc(243, 24052009112341)
Mastro at Work Savo et. al.
22. Ontology usage
Verifying incompleteness in the data through query
answering
Incompleteness of the data
Querying the database directly vs. querying the ontology provides
more answers.
• To retrieve the identification codes of all company groups.
DB operations use id code from GZ0007
• Asking for q(y) ← CompanyGroup(x), id code(x, y)
• Mastro indicates that GZ0007 is not the only relevant table.
Mastro at Work Savo et. al.
23. Ontology usage
Verifying inconsistencies in the data through query
answering
Inconsistency of the data
Using epistemic query answering to locate inconsistent tuples.
• (functional ingrouping−
)
• We can detect the violating tuples using:
SELECT testview.l, testview.c1, testview.c2
FROM sparqltable (SELECT ?l ?c1 ?c2
WHERE{?c1:inGrouping?l. ?c2:inGrouping?l}) testview
WHERE testview.c1 <> testview.c2
Mastro at Work Savo et. al.
26. Query structure
Query Performance
Query answering in DL-Lite for OBDA in a nutshell
• Reformulate w.r.t. T
• Unfold w.r.t. M
• Evaluate
Mastro at Work Savo et. al.
27. Query structure
Query Performance
Query answering in DL-Lite for OBDA in a nutshell
• Reformulate w.r.t. T
• Unfold w.r.t. M
• Evaluate
Sources of complexity
• Reformulation - Size of the reformulation
• Unfolding - Size of the unfolding and query structure
Mastro at Work Savo et. al.
28. Query structure
Query Performance
Query answering in DL-Lite for OBDA in a nutshell
• Reformulate w.r.t. T
• Unfold w.r.t. M
• Evaluate
Sources of complexity
• Reformulation - Size of the reformulation
• Unfolding - Size of the unfolding and query structure
Most critical aspect in the MPS scenario: query structure.
Mastro at Work Savo et. al.
29. Query structure
Query Structure
In Mastro, query unfolding is done by means of partial evaluation
and SQL views.
Mastro at Work Savo et. al.
30. Query structure
Query Structure
In Mastro, query unfolding is done by means of partial evaluation
and SQL views.
Given a Virtual Abox defined by DB, the mappings M and a query
Q to be evaluated we:
• Define a set of auxiliary predicates and SQL views
Mastro at Work Savo et. al.
31. Query structure
Query Structure
In Mastro, query unfolding is done by means of partial evaluation
and SQL views.
Given a Virtual Abox defined by DB, the mappings M and a query
Q to be evaluated we:
• Define a set of auxiliary predicates and SQL views
• Associate these to T by means of a logic program P
Mastro at Work Savo et. al.
32. Query structure
Query Structure
In Mastro, query unfolding is done by means of partial evaluation
and SQL views.
Given a Virtual Abox defined by DB, the mappings M and a query
Q to be evaluated we:
• Define a set of auxiliary predicates and SQL views
• Associate these to T by means of a logic program P
• Compute the partial evaluation of Q with respect to P
Mastro at Work Savo et. al.
33. Query structure
Query Structure
In Mastro, query unfolding is done by means of partial evaluation
and SQL views.
Given a Virtual Abox defined by DB, the mappings M and a query
Q to be evaluated we:
• Define a set of auxiliary predicates and SQL views
• Associate these to T by means of a logic program P
• Compute the partial evaluation of Q with respect to P
• Translate the PE into SQL by means of the views.
Mastro at Work Savo et. al.
34. Query structure
T -views
Example:
The mappings
m1: SELECT .... WHERE cd tp = 503 ; linkedTo(cus(idcus), link(linkid))
m2: SELECT .... WHERE cd tp = 501 ; linkedTo(cus(idcus), link(linkid))
Mastro at Work Savo et. al.
35. Query structure
T -views
Example:
The mappings
m1: SELECT .... WHERE cd tp = 503 ; linkedTo(cus(idcus), link(linkid))
m2: SELECT .... WHERE cd tp = 501 ; linkedTo(cus(idcus), link(linkid))
The view for AuxlinkedTo
SELECT ‘cus(’||idcus||‘)’ as term1, ‘link(’||linkid||‘)’ as term2
FROM (SELECT .... WHERE cd_tp = 503) view_m1
UNION
SELECT ‘cus’(||idcus||‘)’ as term1, ‘link(’||linkid||‘)’ as term2
FROM (SELECT .... WHERE cd_tp = 501) view_m2
Mastro at Work Savo et. al.
36. Query structure
T -views, unfolding
Program
linkedTo(x, y) ← AuxlinkedTo(x, y)
The query
q(x, y) ← linkedTo(x, z), linkedTo(y, z)
The partial evaluation
q(x, y) ← AuxleadsTo(x, z), AuxlinkedTo(y, z)
Mastro at Work Savo et. al.
37. Query structure
T -views, unfolding
SELECT leadsto1.term1, leadsto2.term1 FROM (
SELECT ‘cus(’||idcus||‘)’ as term1, ‘link(’||linkid||‘)’ as term2
FROM (SELECT .... WHERE cd_tp = 503) view_m1
UNION
SELECT ‘cus’(||idcus||‘)’ as term1, ‘link(’||linkid||‘)’ as term2
FROM (SELECT .... WHERE cd_tp = 501) view_m2
) as leadsto1,
(
SELECT ‘cus(’||idcus||‘)’ as term1, ‘link(’||linkid||‘)’ as term2
FROM (SELECT .... WHERE cd_tp = 503) view_m1
UNION
SELECT ‘cus’(||idcus||‘)’ as term1, ‘link(’||linkid||‘)’ as term2
FROM (SELECT .... WHERE cd_tp = 501) view_m2
) as leadsto2
WHERE leadsto1.term2 = leadsto2.term2
Mastro at Work Savo et. al.
38. Query structure
Performance of T -views
Poor performance, in the order of hours, for trivial queries.
Mastro at Work Savo et. al.
39. Query structure
Performance of T -views
Poor performance, in the order of hours, for trivial queries.
Culprit
Materialization of partial results in the DBMS query plans.
Mastro at Work Savo et. al.
40. Query structure
Performance of T -views
Poor performance, in the order of hours, for trivial queries.
Culprit
Materialization of partial results in the DBMS query plans.
Solution
For relational DBMS queries, simpler is better.
Mastro at Work Savo et. al.
41. Query structure
M-views
Example:
Mappings
m1: SELECT .... WHERE cd tp = 503 ; linkedTo(cus(idcus), link(linkid))
m2: SELECT .... WHERE cd tp = 501 ; linkedTo(cus(idcus), link(linkid))
The views:
Auxm1 = SELECT .... WHERE cd tp = 503
Auxm2 = SELECT .... WHERE cd tp = 503
Mastro at Work Savo et. al.
45. Query structure
M-views, unfolding
SELECT ’cus(’||auxm11.idcus||’)’ as x, ’cus(’||auxm12.idcus||’)’ as y
FROM (SELECT .... WHERE cd_tp = 503) as auxm11,
(SELECT .... WHERE cd_tp = 503) as auxm12
WHERE auxm11.linkid = auxm12.linkid
UNION
SELECT ’cus(’||auxm11.idcus||’)’ as x, ’cus(’||auxm21.idcus||’)’ as y
FROM (SELECT .... WHERE cd_tp = 503) as auxm11,
(SELECT .... WHERE cd_tp = 501) as auxm21
WHERE auxm11.linkid = auxm21.linkid
UNION
SELECT ’cus(’||auxm21.idcus||’)’ as x, ’cus(’||auxm22.idcus||’)’ as y
FROM (SELECT .... WHERE cd_tp = 501) as auxm21,
(SELECT .... WHERE cd_tp = 501) as auxm22
WHERE auxm21.linkid = auxm22.linkid
Mastro at Work Savo et. al.
48. Conclusions
MPS feedback
Useful result from the MPS point of view
• Data Integration
• Data Quality
• Knowledge Sharing
From the technical point of view:
• DBMS level performance for on-the-fly OBDA is possible
• Query tuning is mandatory.
• Pinpointed the features of the queries that are needed for
good performance and those that trigger bad performance.
Mastro at Work Savo et. al.
49. Conclusions
Current and Future work
• Experiment with live access to the sources
• Extend the current experimentation to other data domains in
MPS
Preview of the Mastro OBDA plugin and the OBDA plugin for
Protege 4.0
• http://www.dis.uniroma1.it/quonto/
• http://obda.inf.unibz.it
Mastro at Work Savo et. al.