Mining attributes

Mining Attribute Lifecycle
to Predict Faults
and Incompleteness
in Database Applications
Presented by:-
Sandra Alex
Roll no: 40

Outline
 INTRODUCTION
 ATTRIBUTE LIFECYCLE CHARACTERIZATION
 PROPOSED APPROACH
 EXPERIMENT
 PREDICTION
 RELATED WORK
 CONCLUSION
 REFERENCES
Page  2

Introduction
 Each attribute  a value created initially via
Page  3
insertion
 Referenced, updated or deleted
 These occurrences of events, associated with
the states  attribute lifecycle.
 Behaviour of an attribute value from its
insertion to final deletion
 Extract the attribute lifecycle out of a database
application

Introduction
 Our empirical studies discover,
faults and incompleteness in db applications 
highly associated with attribute lifecycle.
 Learned prediction model  applied in
development and maintenance of database
applications
 Experiments conducted on PHP systems
Page  4

Attribute Lifecycle Characterization
Page  5
 for each attribute, a value is
i. created -> insertion
ii. referenced -> selection
iii. updated -> updating
iv. deleted -> deletion
 These occurrences of events are associated
with states , to constitute the attribute
lifecycle.

Page  6
State transition diagram of the attribute lifecycle

 programs sustain attribute lifecycle by 4
database operations:
INSERT, SELECT, UPDATE and DELETE
 formulate the following attributes to
Page  7
characterize its lifecycle:
i. Create (C) -> value of attribute is inserted.
ii. Null Create (NC) -> inserted without
value
iii. Control Update (COU) -> not influenced
by existing attribute value & inputs from
user and database.

Page  8
iv. Overriding Update (OVU)
-> not influenced by existing value.
v. Cumulating Update (CMU)
-> influenced by existing value.
vi. Delete (D) : attribute is deleted as a result
of the deletion of the record
vii. Use (U): value is used to support the
insertion, updating or deletion of other
database attributes or output to the
external environment.

 Hence,we characterize the attribute lifecycle by
Page  9
a seven element vector
[m1, m2, m3, m4, m5, m6, m7],
where m1, m2,m3, m4, m5, m6, m7
denote whether there is database operation
performed on the attribute is of type
C, NC, COU, OVU, CMU, D and U respectively.

Proposed Approach
A. Mining Attribute Lifecycle
Page  10

Proposed Approach
B. Extracting Attribute Lifecycle
Page  11
Characterization Data
1) Query Extraction
<?php
function exec_query($q)
{ return mysql_query($q); }
$query = "SELECT username FROM users WHERE ";
if (isset($_POST[‘usertype’]))
{ $query .= "usertype =" .$_POST[‘usertype’];//use usertype }
else
{ $query .= "userid=" .$_POST[‘userid’]; //use userid }
exec_query($query);
?>
 query can be different in runtime.

Proposed Approach
 control flow graph(CFG) for the code
Page  12

generates a set of basis paths
encounter a query execution function like
“mysql_query”, -> definition of every variable used
is retrieved
literals -> replaced by their actual values
variables whose values are not statically known ->
replaced by placeholders
parts of query strings with replaced values ->
Page  13
connected
Proposed Approach

Proposed Approach
2) Analysis of Attribute Lifecycle
queries are extracted  analysed to obtain the
attribute lifecycle patterns
 by using an SQL grammar parser
CREATE TABLE : first parsed to collect the schema
of table
VIEW: mapping of attributes between the view &
backup table
Page  14

Proposed Approach
 SELECT :
o query is parsed, table aliases restored by the
actual table names, & attributes are identified
o “*” -> schema of table consulted to get all
attribute names
o “count(*)” -> not considered, characterized as
“USE”
Page  15

Proposed Approach
 INSERT:
o table name is identified first
o no column list -> all the attributes inserted.
o column list -> attributes are extracted
o “auto incremental” or have not null default values
-> treated as inserted by the query
o These attributes are characterized as “Create”
o explicitly assigned to null -> marked as “Null
Create”.
Page  16

 UPDATE :
o collect attribute names
o identify the update pattern
o attribute assignments in the SET clause are
Page  17
separated
o analyse the value string to determine the update
characteristic
o either COU, OVU or CMU
o attributes used in the WHERE clause -> marked
as “Use”

DELETE :
oidentify table name
omark all the attributes as “Delete
oattributes in the WHERE clause as “Use”
For each query,
attribute names in it -> put into a collection -> create
Page  18
attribute lifecycle vectors.

3) Generation of Attribute Lifecycle Vectors
For example,
if there is at least one “Create” characteristic for one
attribute,
Page  19
othe first element of the vector 1
o otherwise 0
no operation on an attribute, all elements set to 0
we generate vectors for all attributes in a database
application.

A. Data Collection
seed faults in open-source database applications to
train our model
we chose systems -> should have very few faults
associated with attribute lifecycle.
Page  20
• source code -> publicly available
• application size -> considerable (transaction
number and attribute number)
• mature enough -> very few faults associated
with attribute lifecycle.
Experiment

Experiment
 “batavi” a web-based e-commerce system;
 “webERP”, an accounting & business management system;
 “FrontAccounting”, a professional web based system
 “OpenBusinessNetwork”, application designed for business;
 “SchoolMate”, solution for school administrations.
Page  21

Experiment
attribute lifecycle have a number of common patterns
those which do not follow -> cause errors
we seeded the following common errors
1) Missing function: attributes are provided, function is not
Page  22
catered for during the program design
2) Inconsistency design: correcting the result of a transaction
that updates an attribute by “cumulative update” using
“overriding update”
3) Redundant function: new programs for different types of
operations
4) No Update: new attributes without any update functions

Experiment
B. Experimental Design
three classifiers to learn the prediction model
1) C4.5 classifier
decision tree classification algorithm
uses normalized information gain to split data
information gain of one attribute A
Page  23

Experiment
Info(D) is defined as:
pi : probability that one instance belongs to class i
 In training process,
Page  24
each time the classifier chooses one attribute
with the highest normalized information gain
to split the data until all attributes are used.

Experiment
2) Naïve Bayes classifier
 generative probabilistic model
 Bayes’ theorem:
 assumed that attributes are independent, we have
 For categorical value, the probability P(xi|Ci) is the
proportion of the instances in class Ci which have
attribute xi.
Page  25

Experiment
3) SVM classifier
 Support Vector Machine (SVM)
 based on the statistical learning theory
 trains the classification model by
Page  26
searching the hyper plane
which maximizes the margin between classes

Experiment
C. Model Training
 attributes from the five systems labelled to create the training
set
 manually checked, labelled each attribute as “missing
function” ,“inconsistency design” ,“redundant function, "no
update” or “normal”
Page  27

Experiment
model was trained by three classifiers
for evaluation of trained models  10-fold cross
validation on training set
set was randomly partitioned into 10 folds
each time 9 folds of them as training set
Page  28
and 1 fold was testing set
we computed the average measurements

Experiment
D. Assessing Performance
 probability of detection pd=(tp/(tp+fn))
 probability of false alarm pf=(fp/(fp+tn))
 precision pr=(tp/(tp+fp))
Page  29
 pd  1 pf  0

Page  30
• pd>87
• pf<1.81
• SVM>C4.5
• C4.5>naïve
Bayes
• SVM:
• pd>95%
• pf<0.07%

Prediction
applied prediction model on four database applications ->
to predict whether there are attributes with missing function,
inconsistency design, redundant function and no update.
applied our prediction model learned by SVM to these
systems and counted the attributes that were predicted
Page  31

Prediction
designers could take corresponding actions to
modify these design faults and incompleteness
further, we manually validate all the attributes
predicted
Of all the 107 attributes, 98 are confirmed to be real
prediction precision is 91.59%
Page  32

Conclusion
 For each attribute, we extract the set of attributes
Page  33
that can be extracted from code of database
applications to characterize its lifecycle.
 a characterization vector is formed
 Data mining technique is applied to mine the
attribute lifecycle using the data collected from
database open-source systems.
 We seed errors in mature systems and simulate
the design faults to train our dataset for our
classification method.
 Five types of labelled attributes are obtained.

Conclusion
 Fault and completeness prediction model is then built.
 In our experiment, the model achieved 98.04%
precision and 98.25% recall on average for SVM
 We also applied the model on four database open
source applications to predict
 conduct more comprehensive experiments on a larger
set of systems ,further validate the merits of the
proposed approach
Page  34

References
[1] N. Nagappan and T. Ball, “Static Analysis Tools as Early Indicators
of Pre-release Defect Density,” in Proceedings of the 27th International
Conference on Software Engineering. ACM, 2005, pp. 580–586.
[2] A. Nikora and J. Munson, “Developing Fault Predictors for Evolving
Software Systems,” in Proceedings of Ninth International Software
Metrics Symposium, 2003. IEEE, 2003, pp. 338–350.
[3] A. Watson, T. McCabe, and D. Wallace, “Structured testing: A testing
methodology using the cyclomatic complexity metric,” NIST special
Publication, vol. 500, no. 235, pp. 1–114, 1996.
[4] W. Fan, M. Miller, S. Stolfo, W. Lee, and P. Chan, “Using artificial
anomalies to detect unknown and known network intrusions,”
Knowledge and Information Systems, vol. 6, no. 5, pp. 507–527, 2004.
Page  35

Mining attributes

Recomendados

Recomendados

Más contenido relacionado

Similar a Mining attributes

Similar a Mining attributes (20)

Último

Último (20)

Mining attributes