SlideShare una empresa de Scribd logo
1 de 6
Descargar para leer sin conexión
Venkata Lavanya Korada, Avala Atchyuta Rao / International Journal of Engineering
            Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com
                   Vol. 2, Issue 6, November- December 2012, pp.142-147
     Analyzing & Identifying CFD’s using the Concepts of Data
                              Mining
                    Venkata Lavanya Korada*1, Avala Atchyuta Rao*2
                   *1
                     M.Tech Student, Gokul Institute of Technology & Science, Bobilli , INDIA
            *2
                 Asst.Professor, CSE Dept, Gokul Institute of Technology & Science, Bobilli, INDIA


Abstract
         The recent extension of functional                 effort. To effectively identify data cleaning rules, we
dependencies (FDs) are Conditional functional               develop techniques for discovering CFDs from
dependencies (CFDs) that have recently been                 sample relations. We provide three methods for
proposed which can apply to a pattern of                    CFD discovery. The first, referred to as CFDMiner,
semantically related constraints and they can               is based on techniques for mining closed itemsets,
also be applied as a rules for cleaning relational          and is used to discover constant CFDs, namely,
data. It is often unrealistic to confine completely         CFDs with constant patterns only. The other two
on human experts to design CFDs via an                      algorithms are developed for discovering general
expensive and long manual process. CFD-based                CFDs. The first algorithm, referred to as CTANE, is
cleaning methods in order to be effective it is             a levelwise algorithm that extends TANE, a well-
necessary to have techniques in place that can              known algorithm for mining FDs. The other,
automatically discover or learn CFDs from                   referred to as FastCFD, is based on the depthfirst
sample data. As it is already quite difficult for           approach used in FastFD, a method for discovering
traditional FDs, the discovery problem is more              FDs. It leverages closed-itemset mining to reduce
difficult for CFDs. New challenges have been                search space. Our experimental results demonstrate
introduced for mining pattern in CFD’s. We                  the following.
provide three methods for CFD discovery. The                (i) CFDMiner can be multiple orders of magnitude
first method referred to as CFDMiner, is for                faster than CTANE and FastCFD for constant CFD
constant CFD discovery. It explores the                     discovery.
connection between minimal constant CFDs and                (ii) CTANE works well when a given sample
closed and free patterns. The other two                     relation is large, but it does not scale well with the
algorithms are developed for discovering general            arity of the relation.
CFDs. Our second algorithm, referred to as                  (iii) FastCFD is far more efficient than CTANE
CTANE, it extends TANE to discover general                  when the arity of the relation is large.
CFDs. It is based on an attribute-set/pattern                         As mentioned constant CFDs are
tuple lattice and explores minimal CFDs only.               particularly important for object identification, and
Our third algorithm is FastCFD; elicit general              thus deserve a separate treatment. One wants
CFDs by applying a depth-first search strategy              efficient methods to discover constant CFDs alone,
rather than the level wise approach. With the               without paying the price of discovering all CFDs.
purpose of these algorithms a set of promising              Indeed, as will be seen later, constant CFD
tools can be provided to help reduce manual                 discovery is often several orders of magnitude faster
effort in the design of data-quality rules, for             than general CFD discovery. Levelwise algorithms
users to choose for different applications. They            may not perform well on sample relations of large
help make CFD-based cleaning a practical data               arity, given their inherent exponential complexity.
quality tool                                                More effective methods have to be in place to deal
                                                            with datasets with a large arity. A host of techniques
Keywords – Privacy, Privelets, Data Publishing,             have been developed for (non-redundant)
and Range count Queries.                                    association rule mining, and it is only natural to
                                                            capitalize on these for CFD discovery. As we shall
I. INTRODUCTION                                             see, these techniques can not only be readily used in
         Many investigations are going on                   constant CFD discovery, but also significantly speed
functional dependencies and conditional functional          up general CFD discovery. To our knowledge, no
dependencies are the recent extension of functional         previous work has considered these issues for CFD
dependences. In this paper investigates the                 discovery.
discovery of conditional functional dependencies
(CFDs) by supporting patterns of semantically               II. PREVIOUS WORK
related constants, and can be used as rules for                      The discovery problem has been studied
cleaning relational data. However, finding CFDs is          for FDs for two decades [1], [3] for database design,
an expensive process that involves intensive manual         data archiving, OLAP and data mining. It was first


                                                                                                   142 | P a g e
Venkata Lavanya Korada, Avala Atchyuta Rao / International Journal of Engineering
             Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com
                    Vol. 2, Issue 6, November- December 2012, pp.142-147
investigated in [2], which shows that the problem is          methods have to be in place to deal with datasets
inherently exponential in the arity |R| of the schema         with a large arity. (3) A host of techniques have
R of sample data r. One of the best-known methods             been developed for (non-redundant) association rule
for FD discovery is TANE [3], a levelwise                     mining, and it is only natural to capitalize on these
algorithm [2] that searches an attribute-set                  for CFD discovery. As we shall see, these
containment lattice and derives FDs with k + 1                techniques can not only be readily used in constant
attributes from sets of k attributes, with pruning            CFD discovery, but also significantly speed up
based on FDs generated in previous levels. TANE               general CFD discovery. To our knowledge, no
takes linear time in the size |r| of input sample r, and      previous work has considered these issues for CFD
works well when the arity |R| is not very large. The          discovery.
algorithms of [6], [7], [8] follow a similar levelwise
approach. However, the levelwise algorithms may               III. System Analysis & description
take exponential time in |R| even if the output is not                 Levelwise algorithms may not perform
exponential in |R|. In light of this, another algorithm,      well on sample relations of large arity, given their
referred to as FastFD [4], explores the connection            inherent exponential complexity. More effective
between FD discovery and the problem of finding               methods have to be in place to deal with datasets
minimal covers of hypergraphs, and employs the                with a large arity. A host of techniques have been
depth-first strategy to search minimal covers. Its            developed for (non-redundant) association rule
takes (almost) linear-time in the size of the output,         mining, and it is only natural to capitalize on these
i.e., in the size of the FD cover. It scales better than      for CFD discovery. As we shall see, these
TANE when the arity is large, but it is more                  techniques can not only be readily used in constant
sensitive to the size |r|. Indeed, it is in O(|r|2 log |r|)   CFD discovery, but also significantly speed up
time, when considering data complexity (|R| is                general CFD discovery. To our knowledge, no
assumed constant). There has also been a bottom-up            previous work has considered these issues for CFD
approach [5] based on techniques for learning                 discovery.
general logical descriptions in a hypotheses space.           In light of these considerations we provide the
As shown in [3], TANE outperforms the algorithm               following modules for CFD discovery: one for
of [5]. Recently two sets of algorithms have been             discovering constant CFDs, and the other two for
developed for discovering CFDs [1], [2]. For a fixed          general CFDs.
traditional FD fd, [1] showed that it is NP-complete          (Module: 1) we propose a notion of minimal CFDs
to find useful patterns that, together with fd, make          based on both the minimality of attributes and the
quality CFDs. They provide efficient heuristic                minimality of patterns. Intuitively, minimal CFDs
algorithms for discovering patterns from samples              contain neither redundant attributes nor redundant
w.r.t. a fixed FD. An algorithm for discovering               patterns. Furthermore, we consider frequent CFDs
CFDs,including both traditional FDs and their                 that hold on a sample dataset r, namely, CFDs in
associated patterns, was presented in [2], which is           which the pattern tuples have a support in r above a
an extension of TANE.                                         certain threshold. Frequent CFDs allow us to
           Constant CFD discovery is closely related          accommodate unreliable data with errors and noise.
to association rule mining (e.g., [2]) and in                 Our algorithms find minimal and frequent CFDs to
particular, closed and free itemsets mining (e.g., [3],       help users identify quality cleaning rules from a
[24]).With 100% confidence, an association rule (X,           possibly large set of CFDs that hold on the samples.
tp) # (A, a) is a constant CFD (X " A, (tp ! a)),             (Module: 2) our first algorithm, referred to as
where tp is a constant pattern over attributes X and a        CFDMiner, is for constant CFD discovery. We
is a value in the domain of attribute A. Better still,        explore the connection between minimal constant
there is an intimate connection between left-reduced          CFDs and closed and free patterns. Based on this,
constant CFDs and non-redundant association rules,            CFDMiner finds constant CFDs by leveraging a
which can be found by computing closed itemsets               latest mining technique, which mines closed
and free itemsets. The potential applications of              itemsets and free itemsets in parallel following a
CFDs in data cleaning highlight the need for further          depth-first search scheme.
investigations of CFD discovery. As remarked                  (Module: 3) our second algorithm, referred to as
earlier, constant CFDs are particularly important for         CTANE, extends TANE to discover general CFDs.
object identification, and thus deserve a separate            It is based on an attribute-set/pattern tuple lattice,
treatment. One wants efficient methods to discover            and mines CFDs at level k + 1 of the lattice (i.e.,
constant CFDs alone, without paying the price of              when each set at the level consists of k+1 attributes)
discovering all CFDs. Indeed, as will be seen later,          with pruning based on those at level k. CTANE
constant CFD discovery is often several orders of             discovers minimal CFDs only.
magnitude faster than general CFD discovery                   (Module: 4) our third algorithm, referred to as
           Levelwise algorithms [2] may not perform           FastCFD, discovers general CFDs by employing a
well on sample relations of large arity, given their          depth-first search strategy instead of the levelwise
inherent exponential complexity.More effective                approach. It is a nontrivial extension of FastFD

                                                                                                    143 | P a g e
Venkata Lavanya Korada, Avala Atchyuta Rao / International Journal of Engineering
             Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com
                    Vol. 2, Issue 6, November- December 2012, pp.142-147
mentioned above, by mining pattern tuples. A novel
pruning technique is introduced by FastCFD, by                                                    System

leveraging constant CFDs found by CFDMiner. As                     Provider                                                          EB Authority

opposed to CTANE, FastCFD does not take
exponential time in the arity of sample data when a
canonical cover of CFDs is not exponentially large.                           Add User Details

(Module: 5) Our fifth and final contribution is an                                                                 Add Password

experimental study of the effectiveness and
efficiency of our algorithms, based on real-life data                  Search User Details
                                                                                                                 View User details
(Wisconsin breast cancer and chess datasets from
UCI) and synthetic datasets generated from data                                                            Enter meter no or
                                                                                                           Area Code or Phone No
scraped from the Web. We evaluate the scalability
of these methods by varying the sample size, the
                                                                                             View User details
arity of relation schema, the active domains of
attributes, and the support threshold for frequent
CFDs. We find that constant CFD discovery (using
CFDMiner) is often 3 orders of magnitude faster
than general CFD discovery (using CTANE or
FastCFD). We also find that FastCFD scales well
with the arity: it is up to 3 orders of magnitude faster
than CTANE when the arity is between 10 and 15,
and it performs well when the arity is greater than
                                                           Fig.1: Inter-operational Sequence Diagram for the
30; in contrast, CTANE cannot run to completion
                                                           Framwork
when the arity is above 17. On the other hand,
CTANE is more sensitive to support threshold and
outperforms FastCFD when the threshold is large
and the arity is of a moderate size. We also find that
our pruning techniques via itemset mining are
effective: it improves the performance of FastCFD                                       Provider Login
by 5-10 Folds and makes FastCFD scale well with
the sample size. These results provide a guideline
for when to use CFDMiner, CTANE or FastCFD in                                  Yes                                  No
                                                                                               Check
different applications.These modules provide a set
of promising tools to help reduce manual effort in
the design of data-quality rules, for users to choose                                                                 Unauthorized Person
for different applications. They help make CFD-            Add User Details

based cleaning a practical data quality tool.

IV.   SYSTEM                     DESIGN              &       View Tables
IMPLEMENTATION
          This Component design diagram helps to
model the physical aspects of an object oriented
software system i.e., for the proposed framework it        Change Password
illustrates the architecture of the dependencies
between service provider and consumer.
                                                                                       View User Full Details
          A sequence diagram shows, as parallel
vertical lines (lifelines), different processes or         Fig.2: Inter-operational Use Activity diagram for the
objects that live simultaneously, and, as horizontal       framework
arrows, the messages exchanged between them, in
the order in which they occur. This allows the
specification of simple runtime scenarios in a
graphical manner




                                                                                                                           144 | P a g e
Venkata Lavanya Korada, Avala Atchyuta Rao / International Journal of Engineering
                            Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com
                                   Vol. 2, Issue 6, November- December 2012, pp.142-147
                                                                V. RESULTS
                                                     EB Authority
                     Provider
                                           View Users Details
           Add Users Details
                                           Change Passsword
           Change Passsword
                                           Enter Meter Number
           View User Details- Table wise
                                           or( Area Code) or PhoneNumber
           View User Full details
                                           View User Full details
Provider   String Name()                                                   EB Authority
                                           getName()
           setAttribute()
                                           getAttribute()
           Set Session()
                                           Get Session()
           StringtoString()
                                           StringtoString()

           Fig.3: Inter-operational class diagram for
           Framework
           CTANE Algorithm
           levelwise algorithm for discovering minimal, k-
           frequent (variable and constant) CFDs. It is an
           extension of algorithm TANE [3] for discovering
           FDs.

                                                                                          Fig.4 : To add the user Details




                                                                                          Fig.5 : Welcome Screen for service provider




                                                                                                                              145 | P a g e
Venkata Lavanya Korada, Avala Atchyuta Rao / International Journal of Engineering
            Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com
                   Vol. 2, Issue 6, November- December 2012, pp.142-147
                                                CONCLUSION
                                                          We have developed and implemented three
                                                algorithms for discovering minimal CFDs:
                                                CFDMiner for mining minimal constant CFDs, a
                                                class of CFDs important for both data cleaning and
                                                data integration; CTANE for discovering general
                                                minimal CFDs based on the levelwise approach; and
                                                FastCFD for discovering general minimal CFDs
                                                based on a depth-first search strategy, and a novel
                                                optimization technique via closed-itemset mining.
                                                As suggested by our experimental results, these
                                                provide a set of tools for users to choose for
                                                different applications. When only constant CFDs are
                                                needed, one can simply use CFDMiner without
                                                paying the price of mining general CFDs. When the
                                                arity of a sample dataset is large, one should opt for
                                                FastCFD. When k-frequent CFDs are needed for a
                                                large k, one could use CTANE.

                                                REFERENCE
                                                  [1]    J. Chomicki and J. Marcinkowski,
                                                         “Minimal-change integrity maintenance
                                                         using tuple deletions,” Information and
                                                         Computation, vol. 197, no. 1-2, pp. 90–
                                                         121, 2005.
                                                  [2]    J. Wijsen, “Database repairing using
                                                         updates,” TODS, vol. 30, no. 3, pp. 722–
                                                         768, 2005.
Fig.6 : To find the information of the user       [3]    L. Bravo, W. Fan, and S. Ma, “Extending
                                                         dependencies with conditions,” in VLDB,
                                                         2007.
                                                  [4]    B. Goethals, W. L. Page, and H. Mannila,
                                                         “Mining association rules of simple
                                                         conjunctive queries,” in SDM, 2008.
                                                  [5]    S. Lopes, J.-M. Petit, and L. Lakhal,
                                                         “Efficient     discovery     of    functional
                                                         dependencies and armstrong relations,” in
                                                         EDBT, 2000.
                                                  [6]    T. Calders, R. T. Ng, and J. Wijsen,
                                                         “Searching for dependencies at multiple
                                                         abstraction levels,” TODS, vol. 27, no. 3,
                                                         pp. 229–260, 2003.
                                                  [7]    R. S. King and J. J. Legendre, “Discovery
                                                         of functional and approximate functional
                                                         dependencies in relational databases,”
                                                         JAMDS, vol. 7, no. 1, pp. 49–59, 2003.
                                                  [8]    I. F. Ilyas, V. Markl, P. J. Haas, P. Brown,
                                                         and A. Aboulnaga, “Cords: Automatic
                                                         discovery of correlations and soft
                                                         functional dependencies,” in SIGMOD,
                                                         2004.
                                                  [9]    H. Mannila and H. Toivonen, “Levelwise
                                                         search and borders of theories in
                                                         knowledge discovery,” Data Min. Knowl.
                                                         Discov., vol. 1, no. 3, pp. 259–289, 1997.
                                                  [10]   Gartner, “Forecast: Data quality tools,
                                                         worldwide, 2006-2011,” 2007.
                                                  [11]   B. Goethals, W. L. Page, and H. Mannila,
Fig.7 : User complete information                        “Mining association rules of simple
                                                         conjunctive queries,” in SDM, 2008.

                                                                                      146 | P a g e
Venkata Lavanya Korada, Avala Atchyuta Rao / International Journal of Engineering
            Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com
                   Vol. 2, Issue 6, November- December 2012, pp.142-147
  [12]   R. Medina and N. Lhouari, “A unified
         hierarchy for functional dependencies,
         conditional functional dependencies and
         association rules,” in ICFCA, 2009.

Author List:




               Venkata Lavanya Korada received
B.Tech in Computer science and Engineering from
Thandra Paparaya Instutite of Science and
Technology Affiliated to JNTUH, in 2005 and
Pursuing M.Tech in Computer science from
GOKUL Institute of Technology & Sciences
Affiliated to JNTUK. Her research areas of interest
are Data Mining and Computer Networks.




             Avala Atchyuta Rao received B.Tech in
Computer science and Engineering from Prakasam
Engineering College Affiliated to JNTUH, in 2005
and M.Tech in Nural Networks from GOKUL
Institute of Technology & Sciences Affiliated to
JNTUK, in 2010. He is a live student Member of
CSR. His research areas of interest are Software
Engineering.




                                                                              147 | P a g e

Más contenido relacionado

La actualidad más candente

TopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxTopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptx
Kalpit Desai
 

La actualidad más candente (13)

H42054550
H42054550H42054550
H42054550
 
Analysis of a hybrid cipher algorithm
Analysis of a hybrid cipher algorithmAnalysis of a hybrid cipher algorithm
Analysis of a hybrid cipher algorithm
 
Latent dirichletallocation presentation
Latent dirichletallocation presentationLatent dirichletallocation presentation
Latent dirichletallocation presentation
 
Deep randomized embedding
Deep randomized embeddingDeep randomized embedding
Deep randomized embedding
 
A probabilistic data encryption scheme (pdes)
A probabilistic data encryption scheme (pdes)A probabilistic data encryption scheme (pdes)
A probabilistic data encryption scheme (pdes)
 
TopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxTopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptx
 
Topic model an introduction
Topic model an introductionTopic model an introduction
Topic model an introduction
 
A novel authenticated cipher for rfid systems
A novel authenticated cipher for rfid systemsA novel authenticated cipher for rfid systems
A novel authenticated cipher for rfid systems
 
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
 
Topic Models - LDA and Correlated Topic Models
Topic Models - LDA and Correlated Topic ModelsTopic Models - LDA and Correlated Topic Models
Topic Models - LDA and Correlated Topic Models
 
Topics Modeling
Topics ModelingTopics Modeling
Topics Modeling
 
Cv4201644655
Cv4201644655Cv4201644655
Cv4201644655
 
A New Key Agreement Protocol Using BDP and CSP in Non Commutative Groups
A New Key Agreement Protocol Using BDP and CSP in Non Commutative GroupsA New Key Agreement Protocol Using BDP and CSP in Non Commutative Groups
A New Key Agreement Protocol Using BDP and CSP in Non Commutative Groups
 

Destacado (20)

Iv2616901694
Iv2616901694Iv2616901694
Iv2616901694
 
Ig2616051609
Ig2616051609Ig2616051609
Ig2616051609
 
Ht2515231526
Ht2515231526Ht2515231526
Ht2515231526
 
I26043047
I26043047I26043047
I26043047
 
Ic2615781586
Ic2615781586Ic2615781586
Ic2615781586
 
Autoretrato_Simone_CostaRica
Autoretrato_Simone_CostaRicaAutoretrato_Simone_CostaRica
Autoretrato_Simone_CostaRica
 
Presentació cultura
Presentació culturaPresentació cultura
Presentació cultura
 
Tics
TicsTics
Tics
 
Movses
MovsesMovses
Movses
 
El discurso científico
El discurso científicoEl discurso científico
El discurso científico
 
Fotos laurinha
Fotos laurinhaFotos laurinha
Fotos laurinha
 
Puntiko futbol txapelketa 2013 uztailak 26 juillet
Puntiko futbol txapelketa 2013  uztailak 26 juilletPuntiko futbol txapelketa 2013  uztailak 26 juillet
Puntiko futbol txapelketa 2013 uztailak 26 juillet
 
Como criar um diapositivo
Como criar um diapositivoComo criar um diapositivo
Como criar um diapositivo
 
Trabajo word
Trabajo wordTrabajo word
Trabajo word
 
Orase din Asia
Orase din AsiaOrase din Asia
Orase din Asia
 
Davit
DavitDavit
Davit
 
Boletim 01
Boletim 01Boletim 01
Boletim 01
 
0825560104
08255601040825560104
0825560104
 
Curso de cipa
Curso de cipaCurso de cipa
Curso de cipa
 
De Egipto A Grecia
De Egipto A GreciaDe Egipto A Grecia
De Egipto A Grecia
 

Similar a W26142147

1) Prepare and submit a summary of the contents of the paper y.docx
1) Prepare and submit a summary of the contents of the paper y.docx1) Prepare and submit a summary of the contents of the paper y.docx
1) Prepare and submit a summary of the contents of the paper y.docx
monicafrancis71118
 
Implementation of Improved Apriori Algorithm on Large Dataset using Hadoop
Implementation of Improved Apriori Algorithm on Large Dataset using HadoopImplementation of Improved Apriori Algorithm on Large Dataset using Hadoop
Implementation of Improved Apriori Algorithm on Large Dataset using Hadoop
BRNSSPublicationHubI
 

Similar a W26142147 (20)

Multi-threaded approach in generating frequent itemset of Apriori algorithm b...
Multi-threaded approach in generating frequent itemset of Apriori algorithm b...Multi-threaded approach in generating frequent itemset of Apriori algorithm b...
Multi-threaded approach in generating frequent itemset of Apriori algorithm b...
 
DEVELOPING A NOVEL MULTIDIMENSIONAL MULTIGRANULARITY DATA MINING APPROACH FOR...
DEVELOPING A NOVEL MULTIDIMENSIONAL MULTIGRANULARITY DATA MINING APPROACH FOR...DEVELOPING A NOVEL MULTIDIMENSIONAL MULTIGRANULARITY DATA MINING APPROACH FOR...
DEVELOPING A NOVEL MULTIDIMENSIONAL MULTIGRANULARITY DATA MINING APPROACH FOR...
 
K355662
K355662K355662
K355662
 
K355662
K355662K355662
K355662
 
COMPARATIVE STUDY OF DISTRIBUTED FREQUENT PATTERN MINING ALGORITHMS FOR BIG S...
COMPARATIVE STUDY OF DISTRIBUTED FREQUENT PATTERN MINING ALGORITHMS FOR BIG S...COMPARATIVE STUDY OF DISTRIBUTED FREQUENT PATTERN MINING ALGORITHMS FOR BIG S...
COMPARATIVE STUDY OF DISTRIBUTED FREQUENT PATTERN MINING ALGORITHMS FOR BIG S...
 
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
 
F0423038041
F0423038041F0423038041
F0423038041
 
New approaches with chord in efficient p2p grid resource discovery
New approaches with chord in efficient p2p grid resource discoveryNew approaches with chord in efficient p2p grid resource discovery
New approaches with chord in efficient p2p grid resource discovery
 
NEW APPROACHES WITH CHORD IN EFFICIENT P2P GRID RESOURCE DISCOVERY
NEW APPROACHES WITH CHORD IN EFFICIENT P2P GRID RESOURCE DISCOVERYNEW APPROACHES WITH CHORD IN EFFICIENT P2P GRID RESOURCE DISCOVERY
NEW APPROACHES WITH CHORD IN EFFICIENT P2P GRID RESOURCE DISCOVERY
 
Web Oriented FIM for large scale dataset using Hadoop
Web Oriented FIM for large scale dataset using HadoopWeb Oriented FIM for large scale dataset using Hadoop
Web Oriented FIM for large scale dataset using Hadoop
 
An Enhanced P2P Architecture for Dispersed Service Discovery
An Enhanced P2P Architecture for Dispersed Service DiscoveryAn Enhanced P2P Architecture for Dispersed Service Discovery
An Enhanced P2P Architecture for Dispersed Service Discovery
 
Ijariie1129
Ijariie1129Ijariie1129
Ijariie1129
 
Av24317320
Av24317320Av24317320
Av24317320
 
1) Prepare and submit a summary of the contents of the paper y.docx
1) Prepare and submit a summary of the contents of the paper y.docx1) Prepare and submit a summary of the contents of the paper y.docx
1) Prepare and submit a summary of the contents of the paper y.docx
 
Irjet v4 iA Survey on FP (Growth) Tree using Association Rule Mining7351
Irjet v4 iA Survey on FP (Growth) Tree using Association Rule Mining7351Irjet v4 iA Survey on FP (Growth) Tree using Association Rule Mining7351
Irjet v4 iA Survey on FP (Growth) Tree using Association Rule Mining7351
 
Mining High Utility Patterns in Large Databases using Mapreduce Framework
Mining High Utility Patterns in Large Databases using Mapreduce FrameworkMining High Utility Patterns in Large Databases using Mapreduce Framework
Mining High Utility Patterns in Large Databases using Mapreduce Framework
 
A NOVEL APPROACH TO MINE FREQUENT PATTERNS FROM LARGE VOLUME OF DATASET USING...
A NOVEL APPROACH TO MINE FREQUENT PATTERNS FROM LARGE VOLUME OF DATASET USING...A NOVEL APPROACH TO MINE FREQUENT PATTERNS FROM LARGE VOLUME OF DATASET USING...
A NOVEL APPROACH TO MINE FREQUENT PATTERNS FROM LARGE VOLUME OF DATASET USING...
 
Study on Positive and Negative Rule Based Mining Techniques for E-Commerce Ap...
Study on Positive and Negative Rule Based Mining Techniques for E-Commerce Ap...Study on Positive and Negative Rule Based Mining Techniques for E-Commerce Ap...
Study on Positive and Negative Rule Based Mining Techniques for E-Commerce Ap...
 
Implementation of Improved Apriori Algorithm on Large Dataset using Hadoop
Implementation of Improved Apriori Algorithm on Large Dataset using HadoopImplementation of Improved Apriori Algorithm on Large Dataset using Hadoop
Implementation of Improved Apriori Algorithm on Large Dataset using Hadoop
 
factorization methods
factorization methodsfactorization methods
factorization methods
 

W26142147

  • 1. Venkata Lavanya Korada, Avala Atchyuta Rao / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 2, Issue 6, November- December 2012, pp.142-147 Analyzing & Identifying CFD’s using the Concepts of Data Mining Venkata Lavanya Korada*1, Avala Atchyuta Rao*2 *1 M.Tech Student, Gokul Institute of Technology & Science, Bobilli , INDIA *2 Asst.Professor, CSE Dept, Gokul Institute of Technology & Science, Bobilli, INDIA Abstract The recent extension of functional effort. To effectively identify data cleaning rules, we dependencies (FDs) are Conditional functional develop techniques for discovering CFDs from dependencies (CFDs) that have recently been sample relations. We provide three methods for proposed which can apply to a pattern of CFD discovery. The first, referred to as CFDMiner, semantically related constraints and they can is based on techniques for mining closed itemsets, also be applied as a rules for cleaning relational and is used to discover constant CFDs, namely, data. It is often unrealistic to confine completely CFDs with constant patterns only. The other two on human experts to design CFDs via an algorithms are developed for discovering general expensive and long manual process. CFD-based CFDs. The first algorithm, referred to as CTANE, is cleaning methods in order to be effective it is a levelwise algorithm that extends TANE, a well- necessary to have techniques in place that can known algorithm for mining FDs. The other, automatically discover or learn CFDs from referred to as FastCFD, is based on the depthfirst sample data. As it is already quite difficult for approach used in FastFD, a method for discovering traditional FDs, the discovery problem is more FDs. It leverages closed-itemset mining to reduce difficult for CFDs. New challenges have been search space. Our experimental results demonstrate introduced for mining pattern in CFD’s. We the following. provide three methods for CFD discovery. The (i) CFDMiner can be multiple orders of magnitude first method referred to as CFDMiner, is for faster than CTANE and FastCFD for constant CFD constant CFD discovery. It explores the discovery. connection between minimal constant CFDs and (ii) CTANE works well when a given sample closed and free patterns. The other two relation is large, but it does not scale well with the algorithms are developed for discovering general arity of the relation. CFDs. Our second algorithm, referred to as (iii) FastCFD is far more efficient than CTANE CTANE, it extends TANE to discover general when the arity of the relation is large. CFDs. It is based on an attribute-set/pattern As mentioned constant CFDs are tuple lattice and explores minimal CFDs only. particularly important for object identification, and Our third algorithm is FastCFD; elicit general thus deserve a separate treatment. One wants CFDs by applying a depth-first search strategy efficient methods to discover constant CFDs alone, rather than the level wise approach. With the without paying the price of discovering all CFDs. purpose of these algorithms a set of promising Indeed, as will be seen later, constant CFD tools can be provided to help reduce manual discovery is often several orders of magnitude faster effort in the design of data-quality rules, for than general CFD discovery. Levelwise algorithms users to choose for different applications. They may not perform well on sample relations of large help make CFD-based cleaning a practical data arity, given their inherent exponential complexity. quality tool More effective methods have to be in place to deal with datasets with a large arity. A host of techniques Keywords – Privacy, Privelets, Data Publishing, have been developed for (non-redundant) and Range count Queries. association rule mining, and it is only natural to capitalize on these for CFD discovery. As we shall I. INTRODUCTION see, these techniques can not only be readily used in Many investigations are going on constant CFD discovery, but also significantly speed functional dependencies and conditional functional up general CFD discovery. To our knowledge, no dependencies are the recent extension of functional previous work has considered these issues for CFD dependences. In this paper investigates the discovery. discovery of conditional functional dependencies (CFDs) by supporting patterns of semantically II. PREVIOUS WORK related constants, and can be used as rules for The discovery problem has been studied cleaning relational data. However, finding CFDs is for FDs for two decades [1], [3] for database design, an expensive process that involves intensive manual data archiving, OLAP and data mining. It was first 142 | P a g e
  • 2. Venkata Lavanya Korada, Avala Atchyuta Rao / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 2, Issue 6, November- December 2012, pp.142-147 investigated in [2], which shows that the problem is methods have to be in place to deal with datasets inherently exponential in the arity |R| of the schema with a large arity. (3) A host of techniques have R of sample data r. One of the best-known methods been developed for (non-redundant) association rule for FD discovery is TANE [3], a levelwise mining, and it is only natural to capitalize on these algorithm [2] that searches an attribute-set for CFD discovery. As we shall see, these containment lattice and derives FDs with k + 1 techniques can not only be readily used in constant attributes from sets of k attributes, with pruning CFD discovery, but also significantly speed up based on FDs generated in previous levels. TANE general CFD discovery. To our knowledge, no takes linear time in the size |r| of input sample r, and previous work has considered these issues for CFD works well when the arity |R| is not very large. The discovery. algorithms of [6], [7], [8] follow a similar levelwise approach. However, the levelwise algorithms may III. System Analysis & description take exponential time in |R| even if the output is not Levelwise algorithms may not perform exponential in |R|. In light of this, another algorithm, well on sample relations of large arity, given their referred to as FastFD [4], explores the connection inherent exponential complexity. More effective between FD discovery and the problem of finding methods have to be in place to deal with datasets minimal covers of hypergraphs, and employs the with a large arity. A host of techniques have been depth-first strategy to search minimal covers. Its developed for (non-redundant) association rule takes (almost) linear-time in the size of the output, mining, and it is only natural to capitalize on these i.e., in the size of the FD cover. It scales better than for CFD discovery. As we shall see, these TANE when the arity is large, but it is more techniques can not only be readily used in constant sensitive to the size |r|. Indeed, it is in O(|r|2 log |r|) CFD discovery, but also significantly speed up time, when considering data complexity (|R| is general CFD discovery. To our knowledge, no assumed constant). There has also been a bottom-up previous work has considered these issues for CFD approach [5] based on techniques for learning discovery. general logical descriptions in a hypotheses space. In light of these considerations we provide the As shown in [3], TANE outperforms the algorithm following modules for CFD discovery: one for of [5]. Recently two sets of algorithms have been discovering constant CFDs, and the other two for developed for discovering CFDs [1], [2]. For a fixed general CFDs. traditional FD fd, [1] showed that it is NP-complete (Module: 1) we propose a notion of minimal CFDs to find useful patterns that, together with fd, make based on both the minimality of attributes and the quality CFDs. They provide efficient heuristic minimality of patterns. Intuitively, minimal CFDs algorithms for discovering patterns from samples contain neither redundant attributes nor redundant w.r.t. a fixed FD. An algorithm for discovering patterns. Furthermore, we consider frequent CFDs CFDs,including both traditional FDs and their that hold on a sample dataset r, namely, CFDs in associated patterns, was presented in [2], which is which the pattern tuples have a support in r above a an extension of TANE. certain threshold. Frequent CFDs allow us to Constant CFD discovery is closely related accommodate unreliable data with errors and noise. to association rule mining (e.g., [2]) and in Our algorithms find minimal and frequent CFDs to particular, closed and free itemsets mining (e.g., [3], help users identify quality cleaning rules from a [24]).With 100% confidence, an association rule (X, possibly large set of CFDs that hold on the samples. tp) # (A, a) is a constant CFD (X " A, (tp ! a)), (Module: 2) our first algorithm, referred to as where tp is a constant pattern over attributes X and a CFDMiner, is for constant CFD discovery. We is a value in the domain of attribute A. Better still, explore the connection between minimal constant there is an intimate connection between left-reduced CFDs and closed and free patterns. Based on this, constant CFDs and non-redundant association rules, CFDMiner finds constant CFDs by leveraging a which can be found by computing closed itemsets latest mining technique, which mines closed and free itemsets. The potential applications of itemsets and free itemsets in parallel following a CFDs in data cleaning highlight the need for further depth-first search scheme. investigations of CFD discovery. As remarked (Module: 3) our second algorithm, referred to as earlier, constant CFDs are particularly important for CTANE, extends TANE to discover general CFDs. object identification, and thus deserve a separate It is based on an attribute-set/pattern tuple lattice, treatment. One wants efficient methods to discover and mines CFDs at level k + 1 of the lattice (i.e., constant CFDs alone, without paying the price of when each set at the level consists of k+1 attributes) discovering all CFDs. Indeed, as will be seen later, with pruning based on those at level k. CTANE constant CFD discovery is often several orders of discovers minimal CFDs only. magnitude faster than general CFD discovery (Module: 4) our third algorithm, referred to as Levelwise algorithms [2] may not perform FastCFD, discovers general CFDs by employing a well on sample relations of large arity, given their depth-first search strategy instead of the levelwise inherent exponential complexity.More effective approach. It is a nontrivial extension of FastFD 143 | P a g e
  • 3. Venkata Lavanya Korada, Avala Atchyuta Rao / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 2, Issue 6, November- December 2012, pp.142-147 mentioned above, by mining pattern tuples. A novel pruning technique is introduced by FastCFD, by System leveraging constant CFDs found by CFDMiner. As Provider EB Authority opposed to CTANE, FastCFD does not take exponential time in the arity of sample data when a canonical cover of CFDs is not exponentially large. Add User Details (Module: 5) Our fifth and final contribution is an Add Password experimental study of the effectiveness and efficiency of our algorithms, based on real-life data Search User Details View User details (Wisconsin breast cancer and chess datasets from UCI) and synthetic datasets generated from data Enter meter no or Area Code or Phone No scraped from the Web. We evaluate the scalability of these methods by varying the sample size, the View User details arity of relation schema, the active domains of attributes, and the support threshold for frequent CFDs. We find that constant CFD discovery (using CFDMiner) is often 3 orders of magnitude faster than general CFD discovery (using CTANE or FastCFD). We also find that FastCFD scales well with the arity: it is up to 3 orders of magnitude faster than CTANE when the arity is between 10 and 15, and it performs well when the arity is greater than Fig.1: Inter-operational Sequence Diagram for the 30; in contrast, CTANE cannot run to completion Framwork when the arity is above 17. On the other hand, CTANE is more sensitive to support threshold and outperforms FastCFD when the threshold is large and the arity is of a moderate size. We also find that our pruning techniques via itemset mining are effective: it improves the performance of FastCFD Provider Login by 5-10 Folds and makes FastCFD scale well with the sample size. These results provide a guideline for when to use CFDMiner, CTANE or FastCFD in Yes No Check different applications.These modules provide a set of promising tools to help reduce manual effort in the design of data-quality rules, for users to choose Unauthorized Person for different applications. They help make CFD- Add User Details based cleaning a practical data quality tool. IV. SYSTEM DESIGN & View Tables IMPLEMENTATION This Component design diagram helps to model the physical aspects of an object oriented software system i.e., for the proposed framework it Change Password illustrates the architecture of the dependencies between service provider and consumer. View User Full Details A sequence diagram shows, as parallel vertical lines (lifelines), different processes or Fig.2: Inter-operational Use Activity diagram for the objects that live simultaneously, and, as horizontal framework arrows, the messages exchanged between them, in the order in which they occur. This allows the specification of simple runtime scenarios in a graphical manner 144 | P a g e
  • 4. Venkata Lavanya Korada, Avala Atchyuta Rao / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 2, Issue 6, November- December 2012, pp.142-147 V. RESULTS EB Authority Provider View Users Details Add Users Details Change Passsword Change Passsword Enter Meter Number View User Details- Table wise or( Area Code) or PhoneNumber View User Full details View User Full details Provider String Name() EB Authority getName() setAttribute() getAttribute() Set Session() Get Session() StringtoString() StringtoString() Fig.3: Inter-operational class diagram for Framework CTANE Algorithm levelwise algorithm for discovering minimal, k- frequent (variable and constant) CFDs. It is an extension of algorithm TANE [3] for discovering FDs. Fig.4 : To add the user Details Fig.5 : Welcome Screen for service provider 145 | P a g e
  • 5. Venkata Lavanya Korada, Avala Atchyuta Rao / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 2, Issue 6, November- December 2012, pp.142-147 CONCLUSION We have developed and implemented three algorithms for discovering minimal CFDs: CFDMiner for mining minimal constant CFDs, a class of CFDs important for both data cleaning and data integration; CTANE for discovering general minimal CFDs based on the levelwise approach; and FastCFD for discovering general minimal CFDs based on a depth-first search strategy, and a novel optimization technique via closed-itemset mining. As suggested by our experimental results, these provide a set of tools for users to choose for different applications. When only constant CFDs are needed, one can simply use CFDMiner without paying the price of mining general CFDs. When the arity of a sample dataset is large, one should opt for FastCFD. When k-frequent CFDs are needed for a large k, one could use CTANE. REFERENCE [1] J. Chomicki and J. Marcinkowski, “Minimal-change integrity maintenance using tuple deletions,” Information and Computation, vol. 197, no. 1-2, pp. 90– 121, 2005. [2] J. Wijsen, “Database repairing using updates,” TODS, vol. 30, no. 3, pp. 722– 768, 2005. Fig.6 : To find the information of the user [3] L. Bravo, W. Fan, and S. Ma, “Extending dependencies with conditions,” in VLDB, 2007. [4] B. Goethals, W. L. Page, and H. Mannila, “Mining association rules of simple conjunctive queries,” in SDM, 2008. [5] S. Lopes, J.-M. Petit, and L. Lakhal, “Efficient discovery of functional dependencies and armstrong relations,” in EDBT, 2000. [6] T. Calders, R. T. Ng, and J. Wijsen, “Searching for dependencies at multiple abstraction levels,” TODS, vol. 27, no. 3, pp. 229–260, 2003. [7] R. S. King and J. J. Legendre, “Discovery of functional and approximate functional dependencies in relational databases,” JAMDS, vol. 7, no. 1, pp. 49–59, 2003. [8] I. F. Ilyas, V. Markl, P. J. Haas, P. Brown, and A. Aboulnaga, “Cords: Automatic discovery of correlations and soft functional dependencies,” in SIGMOD, 2004. [9] H. Mannila and H. Toivonen, “Levelwise search and borders of theories in knowledge discovery,” Data Min. Knowl. Discov., vol. 1, no. 3, pp. 259–289, 1997. [10] Gartner, “Forecast: Data quality tools, worldwide, 2006-2011,” 2007. [11] B. Goethals, W. L. Page, and H. Mannila, Fig.7 : User complete information “Mining association rules of simple conjunctive queries,” in SDM, 2008. 146 | P a g e
  • 6. Venkata Lavanya Korada, Avala Atchyuta Rao / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 2, Issue 6, November- December 2012, pp.142-147 [12] R. Medina and N. Lhouari, “A unified hierarchy for functional dependencies, conditional functional dependencies and association rules,” in ICFCA, 2009. Author List: Venkata Lavanya Korada received B.Tech in Computer science and Engineering from Thandra Paparaya Instutite of Science and Technology Affiliated to JNTUH, in 2005 and Pursuing M.Tech in Computer science from GOKUL Institute of Technology & Sciences Affiliated to JNTUK. Her research areas of interest are Data Mining and Computer Networks. Avala Atchyuta Rao received B.Tech in Computer science and Engineering from Prakasam Engineering College Affiliated to JNTUH, in 2005 and M.Tech in Nural Networks from GOKUL Institute of Technology & Sciences Affiliated to JNTUK, in 2010. He is a live student Member of CSR. His research areas of interest are Software Engineering. 147 | P a g e