1. Venkata Lavanya Korada, Avala Atchyuta Rao / International Journal of Engineering
Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com
Vol. 2, Issue 6, November- December 2012, pp.142-147
Analyzing & Identifying CFD’s using the Concepts of Data
Mining
Venkata Lavanya Korada*1, Avala Atchyuta Rao*2
*1
M.Tech Student, Gokul Institute of Technology & Science, Bobilli , INDIA
*2
Asst.Professor, CSE Dept, Gokul Institute of Technology & Science, Bobilli, INDIA
Abstract
The recent extension of functional effort. To effectively identify data cleaning rules, we
dependencies (FDs) are Conditional functional develop techniques for discovering CFDs from
dependencies (CFDs) that have recently been sample relations. We provide three methods for
proposed which can apply to a pattern of CFD discovery. The first, referred to as CFDMiner,
semantically related constraints and they can is based on techniques for mining closed itemsets,
also be applied as a rules for cleaning relational and is used to discover constant CFDs, namely,
data. It is often unrealistic to confine completely CFDs with constant patterns only. The other two
on human experts to design CFDs via an algorithms are developed for discovering general
expensive and long manual process. CFD-based CFDs. The first algorithm, referred to as CTANE, is
cleaning methods in order to be effective it is a levelwise algorithm that extends TANE, a well-
necessary to have techniques in place that can known algorithm for mining FDs. The other,
automatically discover or learn CFDs from referred to as FastCFD, is based on the depthfirst
sample data. As it is already quite difficult for approach used in FastFD, a method for discovering
traditional FDs, the discovery problem is more FDs. It leverages closed-itemset mining to reduce
difficult for CFDs. New challenges have been search space. Our experimental results demonstrate
introduced for mining pattern in CFD’s. We the following.
provide three methods for CFD discovery. The (i) CFDMiner can be multiple orders of magnitude
first method referred to as CFDMiner, is for faster than CTANE and FastCFD for constant CFD
constant CFD discovery. It explores the discovery.
connection between minimal constant CFDs and (ii) CTANE works well when a given sample
closed and free patterns. The other two relation is large, but it does not scale well with the
algorithms are developed for discovering general arity of the relation.
CFDs. Our second algorithm, referred to as (iii) FastCFD is far more efficient than CTANE
CTANE, it extends TANE to discover general when the arity of the relation is large.
CFDs. It is based on an attribute-set/pattern As mentioned constant CFDs are
tuple lattice and explores minimal CFDs only. particularly important for object identification, and
Our third algorithm is FastCFD; elicit general thus deserve a separate treatment. One wants
CFDs by applying a depth-first search strategy efficient methods to discover constant CFDs alone,
rather than the level wise approach. With the without paying the price of discovering all CFDs.
purpose of these algorithms a set of promising Indeed, as will be seen later, constant CFD
tools can be provided to help reduce manual discovery is often several orders of magnitude faster
effort in the design of data-quality rules, for than general CFD discovery. Levelwise algorithms
users to choose for different applications. They may not perform well on sample relations of large
help make CFD-based cleaning a practical data arity, given their inherent exponential complexity.
quality tool More effective methods have to be in place to deal
with datasets with a large arity. A host of techniques
Keywords – Privacy, Privelets, Data Publishing, have been developed for (non-redundant)
and Range count Queries. association rule mining, and it is only natural to
capitalize on these for CFD discovery. As we shall
I. INTRODUCTION see, these techniques can not only be readily used in
Many investigations are going on constant CFD discovery, but also significantly speed
functional dependencies and conditional functional up general CFD discovery. To our knowledge, no
dependencies are the recent extension of functional previous work has considered these issues for CFD
dependences. In this paper investigates the discovery.
discovery of conditional functional dependencies
(CFDs) by supporting patterns of semantically II. PREVIOUS WORK
related constants, and can be used as rules for The discovery problem has been studied
cleaning relational data. However, finding CFDs is for FDs for two decades [1], [3] for database design,
an expensive process that involves intensive manual data archiving, OLAP and data mining. It was first
142 | P a g e
2. Venkata Lavanya Korada, Avala Atchyuta Rao / International Journal of Engineering
Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com
Vol. 2, Issue 6, November- December 2012, pp.142-147
investigated in [2], which shows that the problem is methods have to be in place to deal with datasets
inherently exponential in the arity |R| of the schema with a large arity. (3) A host of techniques have
R of sample data r. One of the best-known methods been developed for (non-redundant) association rule
for FD discovery is TANE [3], a levelwise mining, and it is only natural to capitalize on these
algorithm [2] that searches an attribute-set for CFD discovery. As we shall see, these
containment lattice and derives FDs with k + 1 techniques can not only be readily used in constant
attributes from sets of k attributes, with pruning CFD discovery, but also significantly speed up
based on FDs generated in previous levels. TANE general CFD discovery. To our knowledge, no
takes linear time in the size |r| of input sample r, and previous work has considered these issues for CFD
works well when the arity |R| is not very large. The discovery.
algorithms of [6], [7], [8] follow a similar levelwise
approach. However, the levelwise algorithms may III. System Analysis & description
take exponential time in |R| even if the output is not Levelwise algorithms may not perform
exponential in |R|. In light of this, another algorithm, well on sample relations of large arity, given their
referred to as FastFD [4], explores the connection inherent exponential complexity. More effective
between FD discovery and the problem of finding methods have to be in place to deal with datasets
minimal covers of hypergraphs, and employs the with a large arity. A host of techniques have been
depth-first strategy to search minimal covers. Its developed for (non-redundant) association rule
takes (almost) linear-time in the size of the output, mining, and it is only natural to capitalize on these
i.e., in the size of the FD cover. It scales better than for CFD discovery. As we shall see, these
TANE when the arity is large, but it is more techniques can not only be readily used in constant
sensitive to the size |r|. Indeed, it is in O(|r|2 log |r|) CFD discovery, but also significantly speed up
time, when considering data complexity (|R| is general CFD discovery. To our knowledge, no
assumed constant). There has also been a bottom-up previous work has considered these issues for CFD
approach [5] based on techniques for learning discovery.
general logical descriptions in a hypotheses space. In light of these considerations we provide the
As shown in [3], TANE outperforms the algorithm following modules for CFD discovery: one for
of [5]. Recently two sets of algorithms have been discovering constant CFDs, and the other two for
developed for discovering CFDs [1], [2]. For a fixed general CFDs.
traditional FD fd, [1] showed that it is NP-complete (Module: 1) we propose a notion of minimal CFDs
to find useful patterns that, together with fd, make based on both the minimality of attributes and the
quality CFDs. They provide efficient heuristic minimality of patterns. Intuitively, minimal CFDs
algorithms for discovering patterns from samples contain neither redundant attributes nor redundant
w.r.t. a fixed FD. An algorithm for discovering patterns. Furthermore, we consider frequent CFDs
CFDs,including both traditional FDs and their that hold on a sample dataset r, namely, CFDs in
associated patterns, was presented in [2], which is which the pattern tuples have a support in r above a
an extension of TANE. certain threshold. Frequent CFDs allow us to
Constant CFD discovery is closely related accommodate unreliable data with errors and noise.
to association rule mining (e.g., [2]) and in Our algorithms find minimal and frequent CFDs to
particular, closed and free itemsets mining (e.g., [3], help users identify quality cleaning rules from a
[24]).With 100% confidence, an association rule (X, possibly large set of CFDs that hold on the samples.
tp) # (A, a) is a constant CFD (X " A, (tp ! a)), (Module: 2) our first algorithm, referred to as
where tp is a constant pattern over attributes X and a CFDMiner, is for constant CFD discovery. We
is a value in the domain of attribute A. Better still, explore the connection between minimal constant
there is an intimate connection between left-reduced CFDs and closed and free patterns. Based on this,
constant CFDs and non-redundant association rules, CFDMiner finds constant CFDs by leveraging a
which can be found by computing closed itemsets latest mining technique, which mines closed
and free itemsets. The potential applications of itemsets and free itemsets in parallel following a
CFDs in data cleaning highlight the need for further depth-first search scheme.
investigations of CFD discovery. As remarked (Module: 3) our second algorithm, referred to as
earlier, constant CFDs are particularly important for CTANE, extends TANE to discover general CFDs.
object identification, and thus deserve a separate It is based on an attribute-set/pattern tuple lattice,
treatment. One wants efficient methods to discover and mines CFDs at level k + 1 of the lattice (i.e.,
constant CFDs alone, without paying the price of when each set at the level consists of k+1 attributes)
discovering all CFDs. Indeed, as will be seen later, with pruning based on those at level k. CTANE
constant CFD discovery is often several orders of discovers minimal CFDs only.
magnitude faster than general CFD discovery (Module: 4) our third algorithm, referred to as
Levelwise algorithms [2] may not perform FastCFD, discovers general CFDs by employing a
well on sample relations of large arity, given their depth-first search strategy instead of the levelwise
inherent exponential complexity.More effective approach. It is a nontrivial extension of FastFD
143 | P a g e
3. Venkata Lavanya Korada, Avala Atchyuta Rao / International Journal of Engineering
Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com
Vol. 2, Issue 6, November- December 2012, pp.142-147
mentioned above, by mining pattern tuples. A novel
pruning technique is introduced by FastCFD, by System
leveraging constant CFDs found by CFDMiner. As Provider EB Authority
opposed to CTANE, FastCFD does not take
exponential time in the arity of sample data when a
canonical cover of CFDs is not exponentially large. Add User Details
(Module: 5) Our fifth and final contribution is an Add Password
experimental study of the effectiveness and
efficiency of our algorithms, based on real-life data Search User Details
View User details
(Wisconsin breast cancer and chess datasets from
UCI) and synthetic datasets generated from data Enter meter no or
Area Code or Phone No
scraped from the Web. We evaluate the scalability
of these methods by varying the sample size, the
View User details
arity of relation schema, the active domains of
attributes, and the support threshold for frequent
CFDs. We find that constant CFD discovery (using
CFDMiner) is often 3 orders of magnitude faster
than general CFD discovery (using CTANE or
FastCFD). We also find that FastCFD scales well
with the arity: it is up to 3 orders of magnitude faster
than CTANE when the arity is between 10 and 15,
and it performs well when the arity is greater than
Fig.1: Inter-operational Sequence Diagram for the
30; in contrast, CTANE cannot run to completion
Framwork
when the arity is above 17. On the other hand,
CTANE is more sensitive to support threshold and
outperforms FastCFD when the threshold is large
and the arity is of a moderate size. We also find that
our pruning techniques via itemset mining are
effective: it improves the performance of FastCFD Provider Login
by 5-10 Folds and makes FastCFD scale well with
the sample size. These results provide a guideline
for when to use CFDMiner, CTANE or FastCFD in Yes No
Check
different applications.These modules provide a set
of promising tools to help reduce manual effort in
the design of data-quality rules, for users to choose Unauthorized Person
for different applications. They help make CFD- Add User Details
based cleaning a practical data quality tool.
IV. SYSTEM DESIGN & View Tables
IMPLEMENTATION
This Component design diagram helps to
model the physical aspects of an object oriented
software system i.e., for the proposed framework it Change Password
illustrates the architecture of the dependencies
between service provider and consumer.
View User Full Details
A sequence diagram shows, as parallel
vertical lines (lifelines), different processes or Fig.2: Inter-operational Use Activity diagram for the
objects that live simultaneously, and, as horizontal framework
arrows, the messages exchanged between them, in
the order in which they occur. This allows the
specification of simple runtime scenarios in a
graphical manner
144 | P a g e
4. Venkata Lavanya Korada, Avala Atchyuta Rao / International Journal of Engineering
Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com
Vol. 2, Issue 6, November- December 2012, pp.142-147
V. RESULTS
EB Authority
Provider
View Users Details
Add Users Details
Change Passsword
Change Passsword
Enter Meter Number
View User Details- Table wise
or( Area Code) or PhoneNumber
View User Full details
View User Full details
Provider String Name() EB Authority
getName()
setAttribute()
getAttribute()
Set Session()
Get Session()
StringtoString()
StringtoString()
Fig.3: Inter-operational class diagram for
Framework
CTANE Algorithm
levelwise algorithm for discovering minimal, k-
frequent (variable and constant) CFDs. It is an
extension of algorithm TANE [3] for discovering
FDs.
Fig.4 : To add the user Details
Fig.5 : Welcome Screen for service provider
145 | P a g e
5. Venkata Lavanya Korada, Avala Atchyuta Rao / International Journal of Engineering
Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com
Vol. 2, Issue 6, November- December 2012, pp.142-147
CONCLUSION
We have developed and implemented three
algorithms for discovering minimal CFDs:
CFDMiner for mining minimal constant CFDs, a
class of CFDs important for both data cleaning and
data integration; CTANE for discovering general
minimal CFDs based on the levelwise approach; and
FastCFD for discovering general minimal CFDs
based on a depth-first search strategy, and a novel
optimization technique via closed-itemset mining.
As suggested by our experimental results, these
provide a set of tools for users to choose for
different applications. When only constant CFDs are
needed, one can simply use CFDMiner without
paying the price of mining general CFDs. When the
arity of a sample dataset is large, one should opt for
FastCFD. When k-frequent CFDs are needed for a
large k, one could use CTANE.
REFERENCE
[1] J. Chomicki and J. Marcinkowski,
“Minimal-change integrity maintenance
using tuple deletions,” Information and
Computation, vol. 197, no. 1-2, pp. 90–
121, 2005.
[2] J. Wijsen, “Database repairing using
updates,” TODS, vol. 30, no. 3, pp. 722–
768, 2005.
Fig.6 : To find the information of the user [3] L. Bravo, W. Fan, and S. Ma, “Extending
dependencies with conditions,” in VLDB,
2007.
[4] B. Goethals, W. L. Page, and H. Mannila,
“Mining association rules of simple
conjunctive queries,” in SDM, 2008.
[5] S. Lopes, J.-M. Petit, and L. Lakhal,
“Efficient discovery of functional
dependencies and armstrong relations,” in
EDBT, 2000.
[6] T. Calders, R. T. Ng, and J. Wijsen,
“Searching for dependencies at multiple
abstraction levels,” TODS, vol. 27, no. 3,
pp. 229–260, 2003.
[7] R. S. King and J. J. Legendre, “Discovery
of functional and approximate functional
dependencies in relational databases,”
JAMDS, vol. 7, no. 1, pp. 49–59, 2003.
[8] I. F. Ilyas, V. Markl, P. J. Haas, P. Brown,
and A. Aboulnaga, “Cords: Automatic
discovery of correlations and soft
functional dependencies,” in SIGMOD,
2004.
[9] H. Mannila and H. Toivonen, “Levelwise
search and borders of theories in
knowledge discovery,” Data Min. Knowl.
Discov., vol. 1, no. 3, pp. 259–289, 1997.
[10] Gartner, “Forecast: Data quality tools,
worldwide, 2006-2011,” 2007.
[11] B. Goethals, W. L. Page, and H. Mannila,
Fig.7 : User complete information “Mining association rules of simple
conjunctive queries,” in SDM, 2008.
146 | P a g e
6. Venkata Lavanya Korada, Avala Atchyuta Rao / International Journal of Engineering
Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com
Vol. 2, Issue 6, November- December 2012, pp.142-147
[12] R. Medina and N. Lhouari, “A unified
hierarchy for functional dependencies,
conditional functional dependencies and
association rules,” in ICFCA, 2009.
Author List:
Venkata Lavanya Korada received
B.Tech in Computer science and Engineering from
Thandra Paparaya Instutite of Science and
Technology Affiliated to JNTUH, in 2005 and
Pursuing M.Tech in Computer science from
GOKUL Institute of Technology & Sciences
Affiliated to JNTUK. Her research areas of interest
are Data Mining and Computer Networks.
Avala Atchyuta Rao received B.Tech in
Computer science and Engineering from Prakasam
Engineering College Affiliated to JNTUH, in 2005
and M.Tech in Nural Networks from GOKUL
Institute of Technology & Sciences Affiliated to
JNTUK, in 2010. He is a live student Member of
CSR. His research areas of interest are Software
Engineering.
147 | P a g e