1. Knowledge Discovery inKnowledge Discovery in
Remote Access DatabasesRemote Access Databases
A thesis submitted in partial fulfillment of the requirements for the degree ofA thesis submitted in partial fulfillment of the requirements for the degree of
Doctor of Philosophy in Computer ScienceDoctor of Philosophy in Computer Science
at the Institute of Mathematics and Computer Science Informaticsat the Institute of Mathematics and Computer Science Informatics
Debrecen of UniversityDebrecen of University
By Zakaria Suliman ZubiBy Zakaria Suliman Zubi
Supervised by Prof. Arato Matyas andSupervised by Prof. Arato Matyas and
Prof.Fazekas GáborProf.Fazekas Gábor
2. 2
Overview of the ThesisOverview of the Thesis
Part I
Introduction to Knowledge Discovery in Databases ( KDD) and Data
Mining (DM).
Goal of the Thesis Work.
Part 2
Remote Access KDD models.
Logical Foundation in Data Mining.
Mining the Discovered Association Rules.
Data Mining Query Languages.
Part 3
Knowledge Discovery Query Language ( KDQL).
I-extended Databases (I-ED).
Implementation of KDQL.
Conclusion.
Appendix A , B.
3. 3
Introduction to KDDIntroduction to KDD
and DMand DM
KDD is the process of extracting interesting (non-trivial, implicit,
previously unknown and potentially useful) information or
patterns from data in large databases.
DM is a single step in KDD process which deals with extracting
trends or patterns from raw databases and carefully and
accurately transforms them into useful and understandable
information.
In the introduction part (chapter 1) I will follow the structure of
expressing the History, Importance, Appearances and Tools for
KDD and DM in all sections of the introduction part in this
thesis.
Is a phase in which
noise data and
irrelevant data are
removed from the
collection. Multiple data sources,
often heterogeneous, may
be combined in a common
source.
The data relevant to the
analysis is decided on
and retrieved from the
data collection.
It is a phase in which
the selected data is
transformed into forms
appropriate for the
mining procedure.
It is the crucial step in which
clever techniques are applied
to extract patterns potentially
useful information.
Strictly interesting patterns
representing knowledge are
identified based on a given
measures.
In the final phase in which
the discovered knowledge is
visually represented to the
user.
KDD process
5. 5
Introduction to KDDIntroduction to KDD
and DMand DM
Access to databases was established via Open Database
Connectivity (ODBC) .
Querying the databases can be maintained by Structured Query
Language (SQL). The aim of using SQL is to allow users to define
the data in databases and manipulate that data (adding, deleting and
retrieving ) it from raw databases.
Using Data Visualization to represent Data Mining results.
6. 6
Overview of the ThesisOverview of the Thesis
Part I
Introduction to Knowledge Discovery in Databases ( KDD) and Data
Mining (DM).
Goal of the Thesis Work.
Part 2
Remote Access KDD models.
Logical Foundation in Data Mining.
Mining the Discovered Association Rules.
Data Mining Query Languages.
Part 3
Knowledge Discovery Query Language ( KDQL).
I-extended Databases (I-ED).
Implementation of KDQL.
Conclusion.
Appendix A , B.
7. 7
Goal of the Thesis WorkGoal of the Thesis Work
In this thesis work, we investigated the problem of matching DM
problems with the set of DM algorithms that are suitable for solving it.
The use of visualization and its integration with algorithmic
approaches to tune the parameters of DM algorithms, in order to
support the parameter selection process, currently only explored by
algorithmic approaches, in a more systematic form than using default
values or setting parameter values without clues.
Introducing visualization to provide expressive information about
induced models and statistics entities, and to support the interactive and
dynamic exploration of induced models for DM.
8. 8
Overview of the ThesisOverview of the Thesis
Part I
Introduction to Knowledge Discovery in Databases ( KDD) and Data
Mining (DM).
Goal of the Thesis Work.
Part 2
Remote Access KDD models.
Logical Foundation in Data Mining.
Mining the Discovered Association Rules.
Data Mining Query Languages.
Part 3
Knowledge Discovery Query Language ( KDQL).
I-extended Databases (I-ED).
Implantation of KDQL.
Conclusion.
Appendix A , B.
12. 12
Overview of the ThesisOverview of the Thesis
Part I
Introduction to Knowledge Discovery in Databases ( KDD) and Data
Mining (DM).
Goal of the Thesis Work.
Part 2
Remote Access KDD models.
Logical Foundation in Data Mining.
Mining the Discovered Association Rules.
Data Mining Query Languages.
Part 3
Knowledge Discovery Query Language ( KDQL).
I-extended Databases (I-ED).
Implementation of KDQL.
Conclusion.
Appendix A , B.
13. 13
Logical Foundation in Data
Mining (LFDM)
Expressiveness :First order logic can represent more complex concepts than
traditional attribute-value languages.
Readability : Formulae are easier to read than decision trees or a set of linear
equations.
Background knowledge: Background knowledge can be grown during
discovery time for example, in time series.
Multiple tables: Multiple database tables can be handled without explicit and
expensive joins.
Deductive databases: Logical discovery engines can be transparently linked to
relational databases via deductive databases.
Advantages of Logical Foundation in Data Mining
Disadvantages of Logical Foundation in Data Mining
Language complexity : First order hypothesis are usually constructed through heavy
search ( discovery feasible).
Database access times: Checking one single candidate might involve heavy querying.
Number handling: Logical approaches to discovery usually suffer from poor number
handling capabilities.
14. 14
Translating first order queries into SQL
In our natural language a question such as “find all employers who are
mangers and getting salary or expenses more than 1000000 HUF a year”:
expensive_employee(Name) ← employee(Name, Salary1,
Manager),Salary1 > 1000000, employee(Manager, Salary2),Salary1 >
Salary2
SELECT employee_0.NAME
FROM employee employee_0, employee employee_1
WHERE employee_0.SALARY > 1000000 AND
employee_1.NAME = employee_0.MANAGER AND
employee_0.SALARY > employee_1.SALARY
Logical Foundation in
Data Mining (LFDM)
15. 15
Overview of the ThesisOverview of the Thesis
Part I
Introduction to Knowledge Discovery in Databases ( KDD) and Data
Mining (DM).
Goal of the Thesis Work.
Part 2
Remote Access KDD models.
Logical Foundation in Data Mining.
Mining the Discovered Association Rules.
Data Mining Query Languages.
Part 3
Knowledge Discovery Query Language ( KDQL).
I-extended Databases (I-ED).
Implementation of KDQL.
Conclusion.
Appendix A , B.
16. 16
Association Rules
What is an Association Rule? Association rule is a set of items
T={ia,ib,..,it}
T I, where I is the set of all possible items {i1,i2,…,in} in
D the task relevant data, D is a set of transactions.
An association rule is of the form :
P Q, where P I, Q I, and P Q =Ø.
P Q holds in D with support s and
P Q has a confidence c in the transaction set D
Example: “In 80% of the cases when people buy bread, they also
buy milk”
Bread ==> milk /80%
Mining the DiscoveredMining the Discovered
Association RulesAssociation Rules
⊂
⊂ ⊂ ∩
y(Q/P)ProbabilitQ)(PConfidence =→
Q)y(PProbabilitQ)Support(P ∪=→
17. 17
Mining the Association Rules
What is Mining the association rule? Finding frequent patterns,
associations, correlations, or causal structures among sets of items or
objects in transaction databases, relational databases, and other
information repositories. Selecting the most "interesting" rules based on
their confidence factors. If holds in D with support s and has a
confidence c in the transaction set D.
Applications: Basket data analysis, cross-marketing, catalog design,
loss-leader analysis, clustering, classification, etc.
Examples:
“Body → Head [support, confidence]”
buys(x, “bread”) → buys(x, “milk”) [6%, 65%]
major(x, “CS”) takes(x, “Database”) → grade(x, “5”) [1%, 75%]
Mining the DiscoveredMining the Discovered
Association RulesAssociation Rules
18. 18
How do we Mine Association Rules?
Input :
A database of transactions.
Each transaction is a list of items (Ex. purchased by a customer
in a visit).
Find all rules that associate the presence of one set of items with
that of another set of items.
Example: “98% of people who purchase tires and auto
accessories also get automotive services done”
There are no restrictions on number of items in the body of the
rule.
Mining the DiscoveredMining the Discovered
Association RulesAssociation Rules
Mining the Association Rules cont.
19. 19
Overview of the ThesisOverview of the Thesis
Part I
Introduction to Knowledge Discovery in Databases ( KDD) and Data
Mining (DM).
Goal of the Thesis Work.
Part 2
Remote Access KDD models.
Logical Foundation in Data Mining.
Mining the Discovered Association Rules.
Data Mining Query Languages.
Part 3
Knowledge Discovery Query Language ( KDQL).
I-extended Databases (I-ED).
Implementation of KDQL.
Conclusion.
Appendix A , B.
20. 20
What is Data Mining Query Language?
Data Mining Query Language (DMQL)Data Mining Query Language (DMQL): Is an iterative process to the
KDD process, which discovered knowledge and presented the
knowledge to the user, the evaluation measures can be enhanced, the
mining can be further refined, new data can be selected or further
transformed, or new data sources can be integrated, in order to get
different, more appropriate results.
Data Mining QueryData Mining Query
Language (DMQL)Language (DMQL)
21. 21
Types of discovered patterns by DMQL
Characterization: Data characterization is a summarization of general
features of objects in a target class, and produces what is called characteristic
rules.
Discrimination: Data discrimination produces what are called discriminant
rules and is basically the comparison of the general features of objects
between two classes referred to as the target class and the contrasting class.
Association analysis: Association analysis is the discovery of what are
commonly called association rules.
Classification: Classification analysis is the organization of data in given
classes.
Prediction: Prediction has attracted considerable attention given the potential
implications of successful forecasting in a business context.
Clustering: clustering is the organization of data in classes.
Outlier analysis: Outliers are data elements that cannot be grouped in a given
class or cluster.
Evolution and deviation analysis: Evolution and deviation analysis pertain
to the study of time related data that changes in time.
Data Mining QueryData Mining Query
Language (DMQL)Language (DMQL)
22. 22
Overview of the ThesisOverview of the Thesis
Part I
Introduction to Knowledge Discovery in Databases ( KDD) and Data
Mining (DM).
Goal of the Thesis Work.
Part 2
Remote Access KDD models.
Logical Foundation in Data Mining.
Mining the Discovered Association Rules.
Data Mining Query Languages.
Part 3
Knowledge Discovery Query Language ( KDQL).
I-extended Databases (I-ED).
Implementation of KDQL.
Conclusion.
Appendix A , B.
23. 23
Knowledge Discovery QueryKnowledge Discovery Query
Language ( KDQL)Language ( KDQL)
What is KDQL in principle ?
Knowledge Discovery Query Language (KDQL) is a KDD query language suggested to the ODBC_KDD(2)
model for mining the association rules in the databases (i.e. DBMS, relational database), and then to visualize
the discovered results in different charts forms (i.e. 2D and 3D). KDQL was not implemented namely yet. In
KDQL we join KDD technology and data visualization with conjunction of the request of creating query
language for DM tasks. This leads us to develop a language tool that can handle two approaches in one session.
RequestRequest
DataData
Data toData to
VisualizeVisualize
Visualization ToolVisualization Tool
Database Management SystemDatabase Management System
(DBMS(DBMS((
24. 24
Visualization techniques for DMQL
Data Mining QueryData Mining Query
Language (DMQL)Language (DMQL)
Visualization ToolsVisualization Tools
Database Management SystemDatabase Management System
(DBMS(DBMS((
Knowledge DiscoveryKnowledge Discovery
Query Language ( KDQL)Query Language ( KDQL)
25. 25
Overview of the ThesisOverview of the Thesis
Part I
Introduction to Knowledge Discovery in Databases ( KDD) and Data
Mining (DM).
Goal of the Thesis Work.
Part 2
Remote Access KDD models.
Logical Foundation in Data Mining.
Mining the Discovered Association Rules.
Data Mining Query Languages.
Part 3
Knowledge Discovery Query Language ( KDQL).
I-extended Databases (I-ED).
Implementation of KDQL.
Conclusion.
Appendix A , B.
26. 26
Motivation
I-Extended DatabaseI-Extended Database : Is a database that in addition to data also
contain exceedingly defined generalizations about the data. Moreover,
I-extended database is a database that has similar properties that are in
inductive database. We formalize this concept and show how it can be
used throughout the whole process of DM due to the closure property
of the framework.
The basic message in I-extended database is as follow:
I-extended database consists of a normal database associated to a
subset of patterns from a class of patterns, and an evaluation
function that tells how the patterns occur in the data.
I-extended database can be queried (in principle) just by using
normal relational algebra or SQL, with the added property of being
able to refer to the values of the evaluation function on the
patterns.
Modeling KDD processes as a sequence of queries on i-extended
database gives rise to chances for reasoning and optimizing these
processes.
I-Extended Databases (I-ED)I-Extended Databases (I-ED)
27. 27
Overview of the ThesisOverview of the Thesis
Part I
Introduction to Knowledge Discovery in Databases ( KDD) and Data
Mining (DM).
Goal of the Thesis Work.
Part 2
Remote Access KDD models.
Logical Foundation in Data Mining.
Mining the Discovered Association Rules.
Data Mining Query Languages.
Part 3
Knowledge Discovery Query Language ( KDQL).
I-extended Databases (I-ED).
Implementation of KDQL.
Conclusion.
Appendix A , B.
28. 28
Motivation of KDQL
The background of KDQL came from the Structured Query Language
(SQL) since several extensions to the SQL have been proposed to
serve as a Data Mining Query Language (DMQL).
SQL + DM (rules) = is the appropriate form for this task on the user
interface.
DM (rules) is based on the association rules to interact I-extended
database. The association rules will be obtained by the use of KDQL
rules, and the results will be graphically represented in a 2D and 3D
charts.
Implementation of KDQLImplementation of KDQL
30. 30
Example of KDQL
For example, the rule. { cheese, coke} ==> bread
States that if cheese and coke are bought together in a
transaction, also bread is bought in the same transaction. In
this association rules, the body is a set of items and the head is a
single item. The rule {cheese, coke}==> cheese, is not
interesting because it is a tautology: in fact if the head is
implicated by the body the rule does not provide new
information. This problem has the following formulation:
KDQL RULE Associations AS
SELECT DISTINCT 1..n item AS BODY,
1..1 item AS HEAD,
SUPPORT, CONFIDENCE
FROM Purchase
GROUP BY transaction
EXTRACTING RULES WITH SUPPORT: 0.1,
CONFIDENCE: 0.2
Implementation of KDQLImplementation of KDQL
31. 31
Implementation ofImplementation of
KDQLKDQL
< KDQL_RULES_OP > := KDD RULES < TableName > AS
SELECT DISTINCT < BodyDescr >, < HeadDescr >
[,SUPPORT] [,CONFIDENCE]
[WHERE < WhereClause >]
FROM < FromList > [WHERE < WhereClause >]
GROUP BY < Attribute > < AttributeList>
[HAVING < HavingClause > ]
[CLUSTER BY < Attribute> < AttributeList>
[HAVING < HavingClause > ]
EXTRACTING RULES WITH SUPPORT :< real >,
CONFIDENCE:<real>
< Body_Description_KDQL>:= [< Cardinaly_Sheap > ] < AttrName > < AttrList > AS BODY
/* default cardinality sheap for the Body: 1..n */
< Head_Description_KDQL>:= [< Cardinaly_Sheap > ] < AttrName > < AttrList > AS HEAD
/* default cardinality shaep for the Head: 1..1 */
< Cardinaly_Sheap >:=< Number> .. (< Number> | n)
<AttributeList>:={<AttributeName>,<AttributeName>,…<AttributeName>}
KDQL rules operator
32. 32
Overview of the ThesisOverview of the Thesis
Part I
Introduction to Knowledge Discovery in Databases ( KDD) and Data
Mining (DM).
Goal of the Thesis Work.
Part 2
Remote Access KDD models.
Logical Foundation in Data Mining.
Mining the Discovered Association Rules.
Data Mining Query Languages.
Part 3
Knowledge Discovery Query Language ( KDQL).
I-extended Databases (I-ED).
Implantation of KDQL.
Conclusion.
Appendix A , B.
33. 33
ConclusionConclusion
KDQL is a part of the
ODBC_KDD (2) model .
KDQL calls I-extended
database via ODBC connection.
I-extended database calls all the
requested information from
traditional databases via the
ODBC.
KDQL was implemented to
handle DM task with
visualization.
Visualization techniques can be
maintained to visualize
interesting association rules
discovered from the databases.
34. 34
ResultsResults
The major results of the thesis work are summarized as follows.
Proposing a new remote access KDD model called ODBC_KDD (2) to
build an attractive model that could get results with more detailed
description such as visualization, scripts, statistical inferences and
more.
Proposing and implementing a database concept, called I-extended
database (I-ED) to be maintained and accelerated by the use of
Knowledge Discovery Query Language (KDQL).
In ODBC_KDD (2) model we proposed a query language called
KDQL.KDQL was suggested to interact into the conceptual database
called I-extended database. KDQL is a result of a new KDD query
language which could discover association rules.
Using visualization tools in KDQL to represent the retrieved data
results in different 2D and 3D visual forms such as pie, points, lines
and bars.
Using support and confidence of data item to locate the important
associated rules from the databases by using I-extended database to be
established by KDQL.
35. 35
Overview of the ThesisOverview of the Thesis
Part I
Introduction to Knowledge Discovery in Databases ( KDD) and Data
Mining (DM).
Goal of the Thesis Work.
Part 2
Remote Access KDD models.
Logical Foundation in Data Mining.
Mining the Discovered Association Rules.
Data Mining Query Languages.
Part 3
Knowledge Discovery Query Language ( KDQL).
I-extended Databases (I-ED).
Implementation of KDQL.
Conclusion.
Appendix A , B.
36. 36
Appendix A , B
We introduced the proposed syntax of the
KDQL statement rules.
Appendix A
Appendix B (Images from the program(
37. 37
Dedications and AcknowledgmentsDedications and Acknowledgments
• First I want to thank my wife Emaan Zubi for her understanding and
making the last steps of writing this dissertation enjoyable and also my kids
Yhaia, Mohamed and Suliman for being nice kids while I’m doing this
work.
• My parents father: Suliman Zubi and Mother: Memona Yousef.
• I would like to thank Dr. Fazekas Gábor for accepting me as a Ph.D
student under his supervision. Also I would like to thank him for continuous
encouragement, confidence and support, reviewing the text of this thesis,
and for sharing with me his knowledge and love of this field .
• My senior supervisor Prof. Dr.Arató Mátyás for his encouragements.
• Dr.Kormos Janos, my teacher and friend, for his insightful comments ,
advice and help.
• Dr. Bajalinov Erik for the frequent constructive discussions regarding the
programming in Delphi.
• My deepest thanks to Dr.Varga Katalin and Dr.Várterész Magdolna for
refereeing my Ph.D dissertation work.
• Mr. Basheer Nassain the Libyan student advisor and Mr. Khalid Zintaney
the financial office in the Libyan Embassy, Budapest , for there support.
• All people in this committee.
• Finally I want to thank all my friends and people in the Institute of
Mathematical and Informatics, Debrecen University.