3. Introduction
Each attribute a value created initially via
Page 3
insertion
Referenced, updated or deleted
These occurrences of events, associated with
the states attribute lifecycle.
Behaviour of an attribute value from its
insertion to final deletion
Extract the attribute lifecycle out of a database
application
4. Introduction
Our empirical studies discover,
faults and incompleteness in db applications
highly associated with attribute lifecycle.
Learned prediction model applied in
development and maintenance of database
applications
Experiments conducted on PHP systems
Page 4
5. Attribute Lifecycle Characterization
Page 5
for each attribute, a value is
i. created -> insertion
ii. referenced -> selection
iii. updated -> updating
iv. deleted -> deletion
These occurrences of events are associated
with states , to constitute the attribute
lifecycle.
7. Attribute Lifecycle Characterization
programs sustain attribute lifecycle by 4
database operations:
INSERT, SELECT, UPDATE and DELETE
formulate the following attributes to
Page 7
characterize its lifecycle:
i. Create (C) -> value of attribute is inserted.
ii. Null Create (NC) -> inserted without
value
iii. Control Update (COU) -> not influenced
by existing attribute value & inputs from
user and database.
8. Attribute Lifecycle Characterization
Page 8
iv. Overriding Update (OVU)
-> not influenced by existing value.
v. Cumulating Update (CMU)
-> influenced by existing value.
vi. Delete (D) : attribute is deleted as a result
of the deletion of the record
vii. Use (U): value is used to support the
insertion, updating or deletion of other
database attributes or output to the
external environment.
9. Attribute Lifecycle Characterization
Hence,we characterize the attribute lifecycle by
Page 9
a seven element vector
[m1, m2, m3, m4, m5, m6, m7],
where m1, m2,m3, m4, m5, m6, m7
denote whether there is database operation
performed on the attribute is of type
C, NC, COU, OVU, CMU, D and U respectively.
13. generates a set of basis paths
encounter a query execution function like
“mysql_query”, -> definition of every variable used
is retrieved
literals -> replaced by their actual values
variables whose values are not statically known ->
replaced by placeholders
parts of query strings with replaced values ->
Page 13
connected
Proposed Approach
14. Proposed Approach
2) Analysis of Attribute Lifecycle
queries are extracted analysed to obtain the
attribute lifecycle patterns
by using an SQL grammar parser
CREATE TABLE : first parsed to collect the schema
of table
VIEW: mapping of attributes between the view &
backup table
Page 14
15. Proposed Approach
SELECT :
o query is parsed, table aliases restored by the
actual table names, & attributes are identified
o “*” -> schema of table consulted to get all
attribute names
o “count(*)” -> not considered, characterized as
“USE”
Page 15
16. Proposed Approach
INSERT:
o table name is identified first
o no column list -> all the attributes inserted.
o column list -> attributes are extracted
o “auto incremental” or have not null default values
-> treated as inserted by the query
o These attributes are characterized as “Create”
o explicitly assigned to null -> marked as “Null
Create”.
Page 16
17. UPDATE :
o collect attribute names
o identify the update pattern
o attribute assignments in the SET clause are
Page 17
separated
o analyse the value string to determine the update
characteristic
o either COU, OVU or CMU
o attributes used in the WHERE clause -> marked
as “Use”
18. DELETE :
oidentify table name
omark all the attributes as “Delete
oattributes in the WHERE clause as “Use”
For each query,
attribute names in it -> put into a collection -> create
Page 18
attribute lifecycle vectors.
19. 3) Generation of Attribute Lifecycle Vectors
For example,
if there is at least one “Create” characteristic for one
attribute,
Page 19
othe first element of the vector 1
o otherwise 0
no operation on an attribute, all elements set to 0
we generate vectors for all attributes in a database
application.
20. A. Data Collection
seed faults in open-source database applications to
train our model
we chose systems -> should have very few faults
associated with attribute lifecycle.
Page 20
• source code -> publicly available
• application size -> considerable (transaction
number and attribute number)
• mature enough -> very few faults associated
with attribute lifecycle.
Experiment
21. Experiment
“batavi” a web-based e-commerce system;
“webERP”, an accounting & business management system;
“FrontAccounting”, a professional web based system
“OpenBusinessNetwork”, application designed for business;
“SchoolMate”, solution for school administrations.
Page 21
22. Experiment
attribute lifecycle have a number of common patterns
those which do not follow -> cause errors
we seeded the following common errors
1) Missing function: attributes are provided, function is not
Page 22
catered for during the program design
2) Inconsistency design: correcting the result of a transaction
that updates an attribute by “cumulative update” using
“overriding update”
3) Redundant function: new programs for different types of
operations
4) No Update: new attributes without any update functions
23. Experiment
B. Experimental Design
three classifiers to learn the prediction model
1) C4.5 classifier
decision tree classification algorithm
uses normalized information gain to split data
information gain of one attribute A
Page 23
24. Experiment
Info(D) is defined as:
pi : probability that one instance belongs to class i
In training process,
Page 24
each time the classifier chooses one attribute
with the highest normalized information gain
to split the data until all attributes are used.
25. Experiment
2) Naïve Bayes classifier
generative probabilistic model
Bayes’ theorem:
assumed that attributes are independent, we have
For categorical value, the probability P(xi|Ci) is the
proportion of the instances in class Ci which have
attribute xi.
Page 25
26. Experiment
3) SVM classifier
Support Vector Machine (SVM)
based on the statistical learning theory
trains the classification model by
Page 26
searching the hyper plane
which maximizes the margin between classes
27. Experiment
C. Model Training
attributes from the five systems labelled to create the training
set
manually checked, labelled each attribute as “missing
function” ,“inconsistency design” ,“redundant function, "no
update” or “normal”
Page 27
28. Experiment
model was trained by three classifiers
for evaluation of trained models 10-fold cross
validation on training set
set was randomly partitioned into 10 folds
each time 9 folds of them as training set
Page 28
and 1 fold was testing set
we computed the average measurements
29. Experiment
D. Assessing Performance
probability of detection pd=(tp/(tp+fn))
probability of false alarm pf=(fp/(fp+tn))
precision pr=(tp/(tp+fp))
Page 29
pd 1 pf 0
31. Prediction
applied prediction model on four database applications ->
to predict whether there are attributes with missing function,
inconsistency design, redundant function and no update.
applied our prediction model learned by SVM to these
systems and counted the attributes that were predicted
Page 31
32. Prediction
designers could take corresponding actions to
modify these design faults and incompleteness
further, we manually validate all the attributes
predicted
Of all the 107 attributes, 98 are confirmed to be real
prediction precision is 91.59%
Page 32
33. Conclusion
For each attribute, we extract the set of attributes
Page 33
that can be extracted from code of database
applications to characterize its lifecycle.
a characterization vector is formed
Data mining technique is applied to mine the
attribute lifecycle using the data collected from
database open-source systems.
We seed errors in mature systems and simulate
the design faults to train our dataset for our
classification method.
Five types of labelled attributes are obtained.
34. Conclusion
Fault and completeness prediction model is then built.
In our experiment, the model achieved 98.04%
precision and 98.25% recall on average for SVM
We also applied the model on four database open
source applications to predict
conduct more comprehensive experiments on a larger
set of systems ,further validate the merits of the
proposed approach
Page 34
35. References
[1] N. Nagappan and T. Ball, “Static Analysis Tools as Early Indicators
of Pre-release Defect Density,” in Proceedings of the 27th International
Conference on Software Engineering. ACM, 2005, pp. 580–586.
[2] A. Nikora and J. Munson, “Developing Fault Predictors for Evolving
Software Systems,” in Proceedings of Ninth International Software
Metrics Symposium, 2003. IEEE, 2003, pp. 338–350.
[3] A. Watson, T. McCabe, and D. Wallace, “Structured testing: A testing
methodology using the cyclomatic complexity metric,” NIST special
Publication, vol. 500, no. 235, pp. 1–114, 1996.
[4] W. Fan, M. Miller, S. Stolfo, W. Lee, and P. Chan, “Using artificial
anomalies to detect unknown and known network intrusions,”
Knowledge and Information Systems, vol. 6, no. 5, pp. 507–527, 2004.
Page 35