A brief introduction about ontology-based data access (shortly OBDA) and its core implementation. I presented too a recent simple benchmark between -ontop- and Semantika---two most available software for OBDA framework---in term of query performance (including details in the appendix section). The slides were presented for Friday Research Meeting in Stanford Center for Biomedical Informatics Research (BMIR).
License: Creative Commons by Attribution 3.0
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
Ontology-based data access: why it is so cool!
1. Ontology-Based Data
Access: Why It is So Cool!
Josef Hardi
josef.hardi@stanford.edu
September 4, 2015
Ontology-Based Data Access is a concept developed by Diego Calvanese and
Mariano Rodriguez-Muro in KRDB Research Centre at Free University of Bozen-
Bolzano
2. Outline
● What is Ontology-based Data Access, or OBDA?
○ Motivation
○ System Black Box
○ Process Illustration
● Project -ontop- and Quest
● Experiment
○ Query Answering Performance
○ -ontop- vs Semantika
● Conclusion
● Q&A
3. Acknowledgement
Parts of the slides in this presentation are taken from
tutorial or lecture slides by:
Diego Calvanese,
Mariano Rodriguez-Muro, and
Martin Rezk
5. Think a scenario
Data Layer
Data Service
conceptual view
Image source: (various sources)
What is Ontology-based Data Access?
6. Data Access Bottleneck
Image source: Rezk, Martin. Ontologies Ontop Databases http://www.slideshare.net/MartnRezk/slides-swat4-ls
What is Ontology-based Data Access?
7. Query Answering
tbl_patient+2015
PatientId Name Cell_type cStage
1 Mary true 7
2 John false 6
3 Bill false 4
Cancer type is:
● NSCLC is when Cell_type is
false,
● SCLC is when Cell_type is
true.
Cancer stage is:
● I, II, III, IIIa, IIIb, IV for
NSCLC, corr. cStage: 1 - 6,
● Limited and Extensive for
SCLC, corr. cStage: 7 and 8.
There is “hidden logic” inside
the table that is specifically
used by the application. Not
for querying the data!
8. Query Answering
tbl_patient+2015
PatientId Name Cell_type cStage
1 Mary true 7
2 John false 6
3 Bill false 4
Name cStage
John 6
Bill 4
RESULT
select Name, cStage
from tbl_patient+2015
where Cell_type = false
and cStage >= 4;
9. Can we do it better?
Show me all the patients’ name and stage
status that have large tumor with at least in
a minimum stage IIIa.
Query Answering
10. Bridge the semantics
tbl_patient+2015
PatientId Name Cell_type cStage
1 Mary true 7
2 John false 6
3 Bill false 4
Cancer type is:
● NSCLC is when Cell_type is
false,
● SCLC is when Cell_type is
true.
Cancer stage is:
● I, II, III, IIIa, IIIb, IV for
NSCLC,
● Limited and Extensive for
SCLC.
hasStage
ISA
name
ISA
ISA
hasNeoplasm
SNOMED-CT
*SCLC = Small Cell Lung Cancer, NSCLC = Non-Small Cell Lung Cancer
Query Answering
11. OBDA Answering
● (Data) Sources: represents the external and independent
resources. Existing organization assets.
● Ontology: provides a unified common vocabulary. The
conceptual view of the underlying data
● Mappings: relates the terms in ontology to a set of SQL
views.
Image source: Rezk, Martin. Ontologies Ontop Databases http://www.slideshare.net/MartnRezk/slides-swat4-ls
Query Answering
12. OBDA Answering Black Box
● Rewriting: Create a new query which is the expanded
version of the original query, using all the defined
inclusion assertions in the ontology.
● Unfolding: Substitute each part in the expanded query
with corresponding SQL views from the given mappings.
● Evaluation: Execute the complete SQL to a target RDBMS.
Image source: Kontchakov, Roman, et.al. Ontology-based Data Access: Ontop of Databases. http://www.dcs.bbk.ac.uk/~roman/papers/ISWC13.pdf
Query Answering
13. OBDA Answering Illustration
Q: Show me all the Person in the hospital?
Q’: Show me
all the Person UNION
all the Nurse UNION
all the Doctor UNION
all the Patient UNION
anyone who has
Neoplasm in the hospital?
Rewritten
14. Look where is the source(s)
(No source)
Q’: Show me
all the Person UNION
all the Nurse UNION
all the Doctor UNION
all the Patient UNION
anyone who has
Neoplasm
in the hospital?
Get the list from table Nurse
Get the list from table Doctor
Get the list from table Patient
Get the list from table Cancer
Patient 2015
M
M
M
M
M
OBDA Answering Illustration
15. Substitute with SQL views
Q’: Show me
all the Person UNION
select NurseId from tbl_nurse UNION
select doc_id from tbl_doctor UNION
select pid from tbl_patient UNION
select PatientId from tbl_patient+2015
in the hospital?
OBDA Answering Illustration
Unfolded
16. Execute the SQL
select NurseId from tbl_nurse
UNION
select doc_id from tbl_doctor
UNION
select pid from tbl_patient
UNION
select PatientId from tbl_patient+2015
OBDA Answering Illustration
Evaluated
17. 42!
(Computational) Price to Pay
Query answering in OBDA setting:
● PTIME in the size of ontology (efficiently
tractable)
● AC0
in the size of the data (very efficiently
tractable)
● NP-Complete in the size of query
(exponential)
*Tractable problem: there exists an algorithm that will eventually terminate in a
reasonable amount of time and return you the result.
OBDA Answering Illustration
18.
19. -ontop- Project
● A platform to query relational databases using
SPARQL language,
● The implementation started in 2010,
● Supports several database systems, like: MySQL,
PostgreSQL, H2, SQL Server, Oracle, IBM DB2.
● Distributed under open-source license.
● It is currently being developed within the context of
EU Optique project.
● Fantastic add-ons: Efficient rewriting, Query
optimization, Transitive query, Rules entailment,
Cross-linked datasets.
-ontop-
23. Berlin SPARQL Benchmark (BSBM)
● A benchmark suite built around e-commerce
domain.
○ A set of products is offered by different vendors and
customers are posting product reviews.
● Consists of 12 different queries, emulating
the search and navigation pattern of a
consumer looking for a product.
● A Query-Mix consists of 25 querying actions
that simulate a product search scenario.
● No inference.
Experiment
24. BSBM-100
● Dataset of 100 million triples,
● Transformed into relational db schema:
offer > 5.7 million rows
person > 147 thousand rows
producer > 5 thousand rows
product > 288 thousand rows
productfeature > 47 thousand rows
productfeatureproduct > 5.5 million rows
producttype > 2 thousand rows
producttypeproduct > 1.4 million rows
review > 2.8 million rows
vendor > 2 thousand rows
Experiment
25. Test Databases
● MySQL - v5.6
○ Vanilla
○ Optimized
■ CREATE INDEX
■ OPTIMIZE TABLE - ANALYZE
● PostgreSQL - v9.4.4
○ Vanilla
○ Optimized
■ CREATE INDEX
■ VACUUM TABLE - ANALYZE
Experiment
26. Test Machine
● MacBook Pro
○ OS X Yosemite 64-bit
○ Java 8 (build 1.8.0_51-b16)
○ Intel Core i7 3 GHz
○ Memory 16 GB
○ Flash storage
○ Direct connection - no network cost
Experiment
27. Benchmark Flow
for each obda-endpoint do:
for each dbms do:
for each dbms-variant do:
start endpoint;
start dbms;
loop 2:
run ‘benchmark -runs 100 -w 10’;
stop dbms;
stop endpoint;
Experiment
29. Conclusion
● OBDA offers a non-invasive solution to
existing (legacy) database system for
better data access service.
● A lot of interesting topics can be harvested
from OBDA use case scenarios.
○ Health and clinical domain perhaps?
● OBDA performance relies heavily on the
efficiency of the underlying data
infrastructure (both HW and SW).
32. Query Answering over Database
Image source: Calvanese, Diego. Ontology-Based Data Access and Integration. https://www.essi.upc.edu/docs/slides-obda-2010-02-08
34. Query Answering over Ontology
Image source: Calvanese, Diego. Ontology-Based Data Access and Integration. https://www.essi.upc.edu/docs/slides-obda-2010-02-08
36. Query Answering via Rewriting
Image source: Calvanese, Diego. Ontology-Based Data Access and Integration. https://www.essi.upc.edu/docs/slides-obda-2010-02-08
52. Ontop SQL Creation
SELECT
3 AS `titleQuestType`, NULL AS `titleLang`, QVIEW1.`title` AS `title`,
10 AS `publishDateQuestType`, NULL AS `publishDateLang`, CAST
(QVIEW1.`publishDate` AS CHAR(8000) CHARACTER SET utf8) AS
`publishDate`
FROM review QVIEW1
WHERE
(QVIEW1.`product` = '62033') AND
(QVIEW1.`producer` = '1245') AND
QVIEW1.`publisher` IS NOT NULL AND
QVIEW1.`nr` IS NOT NULL AND
QVIEW1.`title` IS NOT NULL AND
QVIEW1.`publishDate` IS NOT NULL
53. Semantika SQL Creation
SELECT `OBDA_VIEW1`.`title` AS `title`,
`OBDA_VIEW1`.`publishDate` AS `publishDate`
FROM `bsbm100`.`review` AS `OBDA_VIEW1`
WHERE `OBDA_VIEW1`.`publisher` IS NOT NULL AND
`OBDA_VIEW1`.`product` = 62033 AND
`OBDA_VIEW1`.`publishDate` IS NOT NULL AND
`OBDA_VIEW1`.`nr` IS NOT NULL AND
`OBDA_VIEW1`.`title` IS NOT NULL AND
`OBDA_VIEW1`.`producer` = 1245
55. Ontop SQL Creation
SELECT
1 AS `reviewQuestType`, NULL AS `reviewLang`, CONCAT('http://www4.wiwiss.fu-berlin.
de/bizer/bsbm/v01/instances/dataFromRatingSite', REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE
(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(CAST(QVIEW1.`publisher` AS CHAR
(8000) CHARACTER SET utf8),' ', '%20'),'!', '%21'),'@', '%40'),'#', '%23'),'$', '%24'),'&', '%26'),'*', '%42'), '(', '%28'), ')', '%29'), '[', '%5B'), ']', '%5D'),
',', '%2C'), ';', '%3B'), ':', '%3A'), '?', '%3F'), '=', '%3D'), '+', '%2B'), '''', '%22'), '/', '%2F'), '/Review', REPLACE(REPLACE(REPLACE(REPLACE(REPLACE
(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE
(CAST(QVIEW1.`nr` AS CHAR(8000) CHARACTER SET utf8),' ', '%20'),'!', '%21'),'@', '%40'),'#', '%23'),'$', '%24'),'&', '%26'),'*', '%42'), '(', '%28'),
')', '%29'), '[', '%5B'), ']', '%5D'), ',', '%2C'), ';', '%3B'), ':', '%3A'), '?', '%3F'), '=', '%3D'), '+', '%2B'), '''', '%22'), '/', '%2F')) AS `review`,
3 AS `titleQuestType`, NULL AS `titleLang`, QVIEW1.`title` AS `title`,
10 AS `publishDateQuestType`, NULL AS `publishDateLang`, CAST(QVIEW1.`publishDate` AS CHAR(8000) CHARACTER SET utf8) AS
`publishDate`,
4 AS `rating1QuestType`, NULL AS `rating1Lang`, CAST(QVIEW1.`rating1` AS CHAR(8000) CHARACTER SET utf8) AS `rating1`,
4 AS `rating2QuestType`, NULL AS `rating2Lang`, CAST(QVIEW2.`rating2` AS CHAR(8000) CHARACTER SET utf8) AS `rating2`
FROM (
review QVIEW1
LEFT OUTER JOIN review QVIEW2
ON (QVIEW1.`nr` = QVIEW2.`nr`) AND
(QVIEW1.`publisher` = QVIEW2.`publisher`) AND
QVIEW2.`rating2` IS NOT NULL AND
QVIEW1.`publisher` IS NOT NULL AND
QVIEW1.`nr` IS NOT NULL
)
WHERE
QVIEW1.`title` IS NOT NULL AND
QVIEW1.`nr` IS NOT NULL AND
QVIEW1.`publishDate` IS NOT NULL AND
(QVIEW1.`product` = '62033') AND
QVIEW1.`publisher` IS NOT NULL AND
QVIEW1.`rating1` IS NOT NULL AND
(QVIEW1.`producer` = '1245')
56. Semantika SQL Creation
SELECT CONCAT('http://www4.wiwiss.fu-berlin.
de/bizer/bsbm/v01/instances/dataFromRatingSite{1}/Review{2}',' : ','"',
`OBDA_VIEW1`.`publisher`,'" "',`OBDA_VIEW1`.`nr`,'"') AS `review`,
`OBDA_VIEW1`.`title` AS `title`,
`OBDA_VIEW1`.`publishDate` AS `publishDate`,
`OBDA_VIEW1`.`rating1` AS `rating1`,
`OBDA_VIEW1`.`rating2` AS `rating2`
FROM `bsbm100_optimized`.`review` AS `OBDA_VIEW1`
WHERE `OBDA_VIEW1`.`publisher` IS NOT NULL AND
`OBDA_VIEW1`.`product` = 62033 AND
`OBDA_VIEW1`.`publishDate` IS NOT NULL AND
`OBDA_VIEW1`.`nr` IS NOT NULL AND
`OBDA_VIEW1`.`title` IS NOT NULL AND
`OBDA_VIEW1`.`rating1` IS NOT NULL AND
`OBDA_VIEW1`.`producer` = 1245