Enviar búsqueda
Cargar
A frame work for clustering time evolving data
•
0 recomendaciones
•
519 vistas
I
iaemedu
Seguir
Denunciar
Compartir
Denunciar
Compartir
1 de 7
Descargar ahora
Descargar para leer sin conexión
Recomendados
Finding Relationships between the Our-NIR Cluster Results
Finding Relationships between the Our-NIR Cluster Results
CSCJournals
IRJET- Customer Segmentation from Massive Customer Transaction Data
IRJET- Customer Segmentation from Massive Customer Transaction Data
IRJET Journal
Az36311316
Az36311316
IJERA Editor
A fuzzy clustering algorithm for high dimensional streaming data
A fuzzy clustering algorithm for high dimensional streaming data
Alexander Decker
K-means Clustering Method for the Analysis of Log Data
K-means Clustering Method for the Analysis of Log Data
idescitation
A h k clustering algorithm for high dimensional data using ensemble learning
A h k clustering algorithm for high dimensional data using ensemble learning
ijitcs
A Novel Optimization of Cloud Instances with Inventory Theory Applied on Real...
A Novel Optimization of Cloud Instances with Inventory Theory Applied on Real...
aciijournal
K-MEDOIDS CLUSTERING USING PARTITIONING AROUND MEDOIDS FOR PERFORMING FACE R...
K-MEDOIDS CLUSTERING USING PARTITIONING AROUND MEDOIDS FOR PERFORMING FACE R...
ijscmc
Recomendados
Finding Relationships between the Our-NIR Cluster Results
Finding Relationships between the Our-NIR Cluster Results
CSCJournals
IRJET- Customer Segmentation from Massive Customer Transaction Data
IRJET- Customer Segmentation from Massive Customer Transaction Data
IRJET Journal
Az36311316
Az36311316
IJERA Editor
A fuzzy clustering algorithm for high dimensional streaming data
A fuzzy clustering algorithm for high dimensional streaming data
Alexander Decker
K-means Clustering Method for the Analysis of Log Data
K-means Clustering Method for the Analysis of Log Data
idescitation
A h k clustering algorithm for high dimensional data using ensemble learning
A h k clustering algorithm for high dimensional data using ensemble learning
ijitcs
A Novel Optimization of Cloud Instances with Inventory Theory Applied on Real...
A Novel Optimization of Cloud Instances with Inventory Theory Applied on Real...
aciijournal
K-MEDOIDS CLUSTERING USING PARTITIONING AROUND MEDOIDS FOR PERFORMING FACE R...
K-MEDOIDS CLUSTERING USING PARTITIONING AROUND MEDOIDS FOR PERFORMING FACE R...
ijscmc
Ensemble based Distributed K-Modes Clustering
Ensemble based Distributed K-Modes Clustering
IJERD Editor
7. 10083 12464-1-pb
7. 10083 12464-1-pb
IAESIJEECS
A Novel Algorithm for Design Tree Classification with PCA
A Novel Algorithm for Design Tree Classification with PCA
Editor Jacotech
Review of Existing Methods in K-means Clustering Algorithm
Review of Existing Methods in K-means Clustering Algorithm
IRJET Journal
Scalable Constrained Spectral Clustering
Scalable Constrained Spectral Clustering
1crore projects
A Novel Multi- Viewpoint based Similarity Measure for Document Clustering
A Novel Multi- Viewpoint based Similarity Measure for Document Clustering
IJMER
Experimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithms
IJDKP
SVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONS
SVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONS
ijscmcj
Clustering Approach Recommendation System using Agglomerative Algorithm
Clustering Approach Recommendation System using Agglomerative Algorithm
IRJET Journal
A0360109
A0360109
iosrjournals
Survey Paper on Clustering Data Streams Based on Shared Density between Micro...
Survey Paper on Clustering Data Streams Based on Shared Density between Micro...
IRJET Journal
GRAPH BASED LOCAL RECODING FOR DATA ANONYMIZATION
GRAPH BASED LOCAL RECODING FOR DATA ANONYMIZATION
ijdms
Survey on classification algorithms for data mining (comparison and evaluation)
Survey on classification algorithms for data mining (comparison and evaluation)
Alexander Decker
MPSKM Algorithm to Cluster Uneven Dimensional Time Series Subspace Data
MPSKM Algorithm to Cluster Uneven Dimensional Time Series Subspace Data
IRJET Journal
IRJET- A Survey of Text Document Clustering by using Clustering Techniques
IRJET- A Survey of Text Document Clustering by using Clustering Techniques
IRJET Journal
COMPARATIVE PERFORMANCE ANALYSIS OF RNSC AND MCL ALGORITHMS ON POWER-LAW DIST...
COMPARATIVE PERFORMANCE ANALYSIS OF RNSC AND MCL ALGORITHMS ON POWER-LAW DIST...
acijjournal
The improved k means with particle swarm optimization
The improved k means with particle swarm optimization
Alexander Decker
Mine Blood Donors Information through Improved K-Means Clustering
Mine Blood Donors Information through Improved K-Means Clustering
ijcsity
Visualization of sorting algorithms using flash
Visualization of sorting algorithms using flash
iaemedu
Network marketing through buzz marketing strategy
Network marketing through buzz marketing strategy
iaemedu
Reduction of mismatch and shading loss by use
Reduction of mismatch and shading loss by use
iaemedu
Feature integration for image information retrieval using image mining techni...
Feature integration for image information retrieval using image mining techni...
iaemedu
Más contenido relacionado
La actualidad más candente
Ensemble based Distributed K-Modes Clustering
Ensemble based Distributed K-Modes Clustering
IJERD Editor
7. 10083 12464-1-pb
7. 10083 12464-1-pb
IAESIJEECS
A Novel Algorithm for Design Tree Classification with PCA
A Novel Algorithm for Design Tree Classification with PCA
Editor Jacotech
Review of Existing Methods in K-means Clustering Algorithm
Review of Existing Methods in K-means Clustering Algorithm
IRJET Journal
Scalable Constrained Spectral Clustering
Scalable Constrained Spectral Clustering
1crore projects
A Novel Multi- Viewpoint based Similarity Measure for Document Clustering
A Novel Multi- Viewpoint based Similarity Measure for Document Clustering
IJMER
Experimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithms
IJDKP
SVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONS
SVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONS
ijscmcj
Clustering Approach Recommendation System using Agglomerative Algorithm
Clustering Approach Recommendation System using Agglomerative Algorithm
IRJET Journal
A0360109
A0360109
iosrjournals
Survey Paper on Clustering Data Streams Based on Shared Density between Micro...
Survey Paper on Clustering Data Streams Based on Shared Density between Micro...
IRJET Journal
GRAPH BASED LOCAL RECODING FOR DATA ANONYMIZATION
GRAPH BASED LOCAL RECODING FOR DATA ANONYMIZATION
ijdms
Survey on classification algorithms for data mining (comparison and evaluation)
Survey on classification algorithms for data mining (comparison and evaluation)
Alexander Decker
MPSKM Algorithm to Cluster Uneven Dimensional Time Series Subspace Data
MPSKM Algorithm to Cluster Uneven Dimensional Time Series Subspace Data
IRJET Journal
IRJET- A Survey of Text Document Clustering by using Clustering Techniques
IRJET- A Survey of Text Document Clustering by using Clustering Techniques
IRJET Journal
COMPARATIVE PERFORMANCE ANALYSIS OF RNSC AND MCL ALGORITHMS ON POWER-LAW DIST...
COMPARATIVE PERFORMANCE ANALYSIS OF RNSC AND MCL ALGORITHMS ON POWER-LAW DIST...
acijjournal
The improved k means with particle swarm optimization
The improved k means with particle swarm optimization
Alexander Decker
Mine Blood Donors Information through Improved K-Means Clustering
Mine Blood Donors Information through Improved K-Means Clustering
ijcsity
La actualidad más candente
(18)
Ensemble based Distributed K-Modes Clustering
Ensemble based Distributed K-Modes Clustering
7. 10083 12464-1-pb
7. 10083 12464-1-pb
A Novel Algorithm for Design Tree Classification with PCA
A Novel Algorithm for Design Tree Classification with PCA
Review of Existing Methods in K-means Clustering Algorithm
Review of Existing Methods in K-means Clustering Algorithm
Scalable Constrained Spectral Clustering
Scalable Constrained Spectral Clustering
A Novel Multi- Viewpoint based Similarity Measure for Document Clustering
A Novel Multi- Viewpoint based Similarity Measure for Document Clustering
Experimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithms
SVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONS
SVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONS
Clustering Approach Recommendation System using Agglomerative Algorithm
Clustering Approach Recommendation System using Agglomerative Algorithm
A0360109
A0360109
Survey Paper on Clustering Data Streams Based on Shared Density between Micro...
Survey Paper on Clustering Data Streams Based on Shared Density between Micro...
GRAPH BASED LOCAL RECODING FOR DATA ANONYMIZATION
GRAPH BASED LOCAL RECODING FOR DATA ANONYMIZATION
Survey on classification algorithms for data mining (comparison and evaluation)
Survey on classification algorithms for data mining (comparison and evaluation)
MPSKM Algorithm to Cluster Uneven Dimensional Time Series Subspace Data
MPSKM Algorithm to Cluster Uneven Dimensional Time Series Subspace Data
IRJET- A Survey of Text Document Clustering by using Clustering Techniques
IRJET- A Survey of Text Document Clustering by using Clustering Techniques
COMPARATIVE PERFORMANCE ANALYSIS OF RNSC AND MCL ALGORITHMS ON POWER-LAW DIST...
COMPARATIVE PERFORMANCE ANALYSIS OF RNSC AND MCL ALGORITHMS ON POWER-LAW DIST...
The improved k means with particle swarm optimization
The improved k means with particle swarm optimization
Mine Blood Donors Information through Improved K-Means Clustering
Mine Blood Donors Information through Improved K-Means Clustering
Destacado
Visualization of sorting algorithms using flash
Visualization of sorting algorithms using flash
iaemedu
Network marketing through buzz marketing strategy
Network marketing through buzz marketing strategy
iaemedu
Reduction of mismatch and shading loss by use
Reduction of mismatch and shading loss by use
iaemedu
Feature integration for image information retrieval using image mining techni...
Feature integration for image information retrieval using image mining techni...
iaemedu
Application of non traditional optimization for quality improvement in tool ...
Application of non traditional optimization for quality improvement in tool ...
iaemedu
Octave wave sound signal measurements in ducted axial fan under stall region ...
Octave wave sound signal measurements in ducted axial fan under stall region ...
iaemedu
Evaluation of the saharan aerosol impact on solar radiation over the tamanras...
Evaluation of the saharan aerosol impact on solar radiation over the tamanras...
iaemedu
Influence of local segmentation in the context of digital image processing
Influence of local segmentation in the context of digital image processing
iaemedu
Design and development of an automotive vertical doors opening system avdos
Design and development of an automotive vertical doors opening system avdos
iaemedu
Optimal placement of custom power
Optimal placement of custom power
iaemedu
An improved robust and secured image steganographic scheme
An improved robust and secured image steganographic scheme
iaemedu
Destacado
(11)
Visualization of sorting algorithms using flash
Visualization of sorting algorithms using flash
Network marketing through buzz marketing strategy
Network marketing through buzz marketing strategy
Reduction of mismatch and shading loss by use
Reduction of mismatch and shading loss by use
Feature integration for image information retrieval using image mining techni...
Feature integration for image information retrieval using image mining techni...
Application of non traditional optimization for quality improvement in tool ...
Application of non traditional optimization for quality improvement in tool ...
Octave wave sound signal measurements in ducted axial fan under stall region ...
Octave wave sound signal measurements in ducted axial fan under stall region ...
Evaluation of the saharan aerosol impact on solar radiation over the tamanras...
Evaluation of the saharan aerosol impact on solar radiation over the tamanras...
Influence of local segmentation in the context of digital image processing
Influence of local segmentation in the context of digital image processing
Design and development of an automotive vertical doors opening system avdos
Design and development of an automotive vertical doors opening system avdos
Optimal placement of custom power
Optimal placement of custom power
An improved robust and secured image steganographic scheme
An improved robust and secured image steganographic scheme
Similar a A frame work for clustering time evolving data
SCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMS
SCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMS
ijdkp
Drsp dimension reduction for similarity matching and pruning of time series ...
Drsp dimension reduction for similarity matching and pruning of time series ...
IJDKP
IRJET- Enhanced Density Based Method for Clustering Data Stream
IRJET- Enhanced Density Based Method for Clustering Data Stream
IRJET Journal
Dynamic approach to k means clustering algorithm-2
Dynamic approach to k means clustering algorithm-2
IAEME Publication
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
acijjournal
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...
IJCSIS Research Publications
A Study in Employing Rough Set Based Approach for Clustering on Categorical ...
A Study in Employing Rough Set Based Approach for Clustering on Categorical ...
IOSR Journals
G0354451
G0354451
iosrjournals
A Novel Approach for Clustering Big Data based on MapReduce
A Novel Approach for Clustering Big Data based on MapReduce
IJECEIAES
50120130406008
50120130406008
IAEME Publication
Principle Component Analysis Based on Optimal Centroid Selection Model for Su...
Principle Component Analysis Based on Optimal Centroid Selection Model for Su...
ijtsrd
Vol 16 No 2 - July-December 2016
Vol 16 No 2 - July-December 2016
ijcsbi
84cc04ff77007e457df6aa2b814d2346bf1b
84cc04ff77007e457df6aa2b814d2346bf1b
PRAWEEN KUMAR
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
IOSR Journals
Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...
Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...
IJCNCJournal
Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...
Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...
IJCNCJournal
How Partitioning Clustering Technique For Implementing...
How Partitioning Clustering Technique For Implementing...
Nicolle Dammann
Certain Investigation on Dynamic Clustering in Dynamic Datamining
Certain Investigation on Dynamic Clustering in Dynamic Datamining
ijdmtaiir
Different Classification Technique for Data mining in Insurance Industry usin...
Different Classification Technique for Data mining in Insurance Industry usin...
IOSRjournaljce
IRJET- Sampling Selection Strategy for Large Scale Deduplication of Synthetic...
IRJET- Sampling Selection Strategy for Large Scale Deduplication of Synthetic...
IRJET Journal
Similar a A frame work for clustering time evolving data
(20)
SCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMS
SCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMS
Drsp dimension reduction for similarity matching and pruning of time series ...
Drsp dimension reduction for similarity matching and pruning of time series ...
IRJET- Enhanced Density Based Method for Clustering Data Stream
IRJET- Enhanced Density Based Method for Clustering Data Stream
Dynamic approach to k means clustering algorithm-2
Dynamic approach to k means clustering algorithm-2
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...
A Study in Employing Rough Set Based Approach for Clustering on Categorical ...
A Study in Employing Rough Set Based Approach for Clustering on Categorical ...
G0354451
G0354451
A Novel Approach for Clustering Big Data based on MapReduce
A Novel Approach for Clustering Big Data based on MapReduce
50120130406008
50120130406008
Principle Component Analysis Based on Optimal Centroid Selection Model for Su...
Principle Component Analysis Based on Optimal Centroid Selection Model for Su...
Vol 16 No 2 - July-December 2016
Vol 16 No 2 - July-December 2016
84cc04ff77007e457df6aa2b814d2346bf1b
84cc04ff77007e457df6aa2b814d2346bf1b
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...
Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...
Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...
Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...
How Partitioning Clustering Technique For Implementing...
How Partitioning Clustering Technique For Implementing...
Certain Investigation on Dynamic Clustering in Dynamic Datamining
Certain Investigation on Dynamic Clustering in Dynamic Datamining
Different Classification Technique for Data mining in Insurance Industry usin...
Different Classification Technique for Data mining in Insurance Industry usin...
IRJET- Sampling Selection Strategy for Large Scale Deduplication of Synthetic...
IRJET- Sampling Selection Strategy for Large Scale Deduplication of Synthetic...
Más de iaemedu
Tech transfer making it as a risk free approach in pharmaceutical and biotech in
Tech transfer making it as a risk free approach in pharmaceutical and biotech in
iaemedu
Integration of feature sets with machine learning techniques
Integration of feature sets with machine learning techniques
iaemedu
Effective broadcasting in mobile ad hoc networks using grid
Effective broadcasting in mobile ad hoc networks using grid
iaemedu
Effect of scenario environment on the performance of mane ts routing
Effect of scenario environment on the performance of mane ts routing
iaemedu
Adaptive job scheduling with load balancing for workflow application
Adaptive job scheduling with load balancing for workflow application
iaemedu
Survey on transaction reordering
Survey on transaction reordering
iaemedu
Semantic web services and its challenges
Semantic web services and its challenges
iaemedu
Website based patent information searching mechanism
Website based patent information searching mechanism
iaemedu
Revisiting the experiment on detecting of replay and message modification
Revisiting the experiment on detecting of replay and message modification
iaemedu
Prediction of customer behavior using cma
Prediction of customer behavior using cma
iaemedu
Performance analysis of manet routing protocol in presence
Performance analysis of manet routing protocol in presence
iaemedu
Performance measurement of different requirements engineering
Performance measurement of different requirements engineering
iaemedu
Mobile safety systems for automobiles
Mobile safety systems for automobiles
iaemedu
Efficient text compression using special character replacement
Efficient text compression using special character replacement
iaemedu
Agile programming a new approach
Agile programming a new approach
iaemedu
Adaptive load balancing techniques in global scale grid environment
Adaptive load balancing techniques in global scale grid environment
iaemedu
A survey on the performance of job scheduling in workflow application
A survey on the performance of job scheduling in workflow application
iaemedu
A survey of mitigating routing misbehavior in mobile ad hoc networks
A survey of mitigating routing misbehavior in mobile ad hoc networks
iaemedu
A novel approach for satellite imagery storage by classify
A novel approach for satellite imagery storage by classify
iaemedu
A self recovery approach using halftone images for medical imagery
A self recovery approach using halftone images for medical imagery
iaemedu
Más de iaemedu
(20)
Tech transfer making it as a risk free approach in pharmaceutical and biotech in
Tech transfer making it as a risk free approach in pharmaceutical and biotech in
Integration of feature sets with machine learning techniques
Integration of feature sets with machine learning techniques
Effective broadcasting in mobile ad hoc networks using grid
Effective broadcasting in mobile ad hoc networks using grid
Effect of scenario environment on the performance of mane ts routing
Effect of scenario environment on the performance of mane ts routing
Adaptive job scheduling with load balancing for workflow application
Adaptive job scheduling with load balancing for workflow application
Survey on transaction reordering
Survey on transaction reordering
Semantic web services and its challenges
Semantic web services and its challenges
Website based patent information searching mechanism
Website based patent information searching mechanism
Revisiting the experiment on detecting of replay and message modification
Revisiting the experiment on detecting of replay and message modification
Prediction of customer behavior using cma
Prediction of customer behavior using cma
Performance analysis of manet routing protocol in presence
Performance analysis of manet routing protocol in presence
Performance measurement of different requirements engineering
Performance measurement of different requirements engineering
Mobile safety systems for automobiles
Mobile safety systems for automobiles
Efficient text compression using special character replacement
Efficient text compression using special character replacement
Agile programming a new approach
Agile programming a new approach
Adaptive load balancing techniques in global scale grid environment
Adaptive load balancing techniques in global scale grid environment
A survey on the performance of job scheduling in workflow application
A survey on the performance of job scheduling in workflow application
A survey of mitigating routing misbehavior in mobile ad hoc networks
A survey of mitigating routing misbehavior in mobile ad hoc networks
A novel approach for satellite imagery storage by classify
A novel approach for satellite imagery storage by classify
A self recovery approach using halftone images for medical imagery
A self recovery approach using halftone images for medical imagery
A frame work for clustering time evolving data
1.
INTERNATIONALComputer Volume OF
COMPUTER ENGINEERING – International Journal of JOURNAL 3, Issueand Technology (IJCET), ISSN 0976 6367(Print), ISSN 0976 – 6375(Online) Engineering 3, October-December (2012), © IAEME & TECHNOLOGY (IJCET) ISSN 0976 – 6367(Print) ISSN 0976 – 6375(Online) Volume 3, Issue 3, October - December (2012), pp. 377-383 IJCET © IAEME: www.iaeme.com/ijcet.asp Journal Impact Factor (2012): 3.9580 (Calculated by GISI) ©IAEME www.jifactor.com A FRAME WORK FOR CLUSTERING TIME EVOLVING DATA USING SLIDING WINDOW TECHNIQUE Y. Swapna1, S. Ravi Sankar2 1 (Faculty, CSE Department, National Institute of Technology, Goa, India, spr@nitgoa.ac.in) 2 (Faculty, CSE Department, National Institute of Technology, Goa, India, srs@nitgoa.ac.in) ABSTRACT Clustering is defined as the process of dividing a dataset into mutually exclusive groups such that the members of each group are as "close" as possible to one another and different groups are as "far" as possible from one another. Sampling is defined as representing large data sets into smaller random samples of data. It is used to improve the efficiency of clustering. Though sampling is applied, the points that are not sampled will not have their labels after the normal process. The problem has been solved for numerical domain, where as clustering of time- evolving data in the categorical domain still remains a challenging issue. In this paper, Sliding Window is used to form subset of data from dataset using specified size (i.e.) collection of data from the database and transfer to the module. The drifting concept detection has been proposed which introduces new algorithm that finds the number of outliers that cannot be assigned to any of the cluster. The objective of this algorithm is to compare the distribution of clusters and outliers between the last clustering result and the current temporal clustering result. The experimental evaluation shows that performing DCD is faster than doing clustering once on the entire data set and DCD can provide high-quality clustering results with correctly detected drifting concepts. Keywords: clustering, sampling, categorical domain, labels, sliding window, drifting concept detection. I. INTRODUCTION Our present information age society thrives and evolves on knowledge. Knowledge is derived from information gleaned from a wide variety of reservoirs of data (databases). Clustering is an important technique for exploratory data analysis and has been the focus of substantial research in several domains for decades. Clusters are connected regions of a multi- dimensional space containing of a relatively high density of points, separated from other such 377
2.
International Journal of
Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print), ISSN 0976 – 6375(Online) Volume 3, Issue 3, October-December (2012), © IAEME regions by a region containing a low density of points. It is useful for classification and can reveal the structure in high-dimensional data spaces, outliers may be interesting, statistical pattern recognition, machine learning, and information retrieval because of its use in a wide range of applications. Cluster analysis is the assignment of a set of observations into subsets (called clusters) so that observations in the same cluster are similar in some sense. It helps us to gain insight into the data distribution. In real world domain, the concept of interest may depend on some hidden context, not given plainly in the form of predictive features, which has become a problem as these concepts drift with time. A suitable example would be buying preferences of customers which may change with time, depending on their needs, climatic conditions, discounts etc. Since the concepts behind the data evolve with time, the underlying clusters may also change significantly with time. The concept not only decreases the quality of clusters but also disregards the expectations of users, which usually require recent clustering results. Many works have been explored based on the problem of clustering time-evolving data in the numerical domain. Categorical attributes also prevalently exist in real data with drifting concepts, for example Web logs that record the browsing history of users, stock market details, buying records of customers often evolve with time. Previous works on clustering categorical data focus on doing clustering on the entire data set and drifting concepts were not taken consideration. Consequently, the problem of clustering time evolving data in the categorical domain remains a challenging issue. The objective is to propose a framework for performing clustering on the categorical time-evolving data. The goal is to use a generalized clustering framework that utilizes existing clustering algorithms that detects if there is a drifting concept or not in the incoming data, instead of designing a specific clustering algorithm. Sliding window technique is adopted to detect the drifting concepts. II. RELATED WORK Many different numerical clustering algorithms have been proposed that consider the time- evolving data and traditional categorical clustering algorithms [1]. An effective and efficient method, called, clustream for clustering large evolving data streams was proposed by [5]. This method tries to cluster the whole stream at one time rather than viewing the stream as a changing process over time. A density-based method called DenStream was proposed in [2] for discovering clusters in an evolving data stream. Evolutionary clustering algorithms were proposed in [5] and [3]. They adopted the same method that performs data clustering over time and tries to optimize two potentially conflicting criteria: first, the previous and the present cluster must be similar without drifting concept, and second, clustering should reflect the data arrived at that time step with the drifting concept. In [6], a generic frame work for this problem used k-means and agglomerative hierarchical clustering algorithms that were extended according to the problem domain. In [5], a measure of temporal smoothness is integrated in the overall measure of clustering quality. Due to this, the proposed method uses stable and consistent clustering results that are less sensitive to short-term noises while at the same time are adaptive to long-term cluster drifts. The previously proposed methods have concentrated on the problem of clustering time evolving data in the numerical domain. In [4], problem of clustering categorical data is discussed, which performs clustering on customer transaction data in a market database. 378
3.
International Journal of
Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print), ISSN 0976 – 6375(Online) Volume 3, Issue 3, October-December (2012), © IAEME In [6], [4], a framework to perform clustering on the categorical time-evolving data has been proposed. Especially the rough membership function in rough set theory represents a concept that induces a fuzzy set. Several extension works based on k-modes are presented for different objectives, fuzzy k- modes [6], initial points refinement [2], etc. These categorical algorithms focus on performing clustering on the entire data set and do not consider time-evolving trends. III. THE PROPOSED APPROACH We propose a generalized clustering framework that utilizes existing clustering algorithms and detects if there is a drifting concept or not in the incoming data. In order to detect the drifting concepts at different sliding windows, we propose the algorithm DCD to compare the cluster distributions between the last clustering result and the temporal current clustering result. It is a collection of data which is extracted from the database that we are going to cluster and the data from the database which is time evolving categorical data (It is not sequential basis manner). We used a synthetic data generator [5] to generate data sets with different number of data points and attributes. The number of data points varies from 10,000 to 100,000, and the dimensionality is in the range of 10-50. In all synthetic data sets, each dimension possesses 20 attribute values. Sliding Window is used to form subset of data from dataset using specified size (i.e.) collection of data from the database and transfer to the module. In this paper, a practical categorical clustering representative, named “Node Importance Representative” (abbreviated as NIR), is utilized. It represents clusters by measuring the importance of each attribute value in the clusters. Drifting Concept Detection (DCD) algorithm (fig.2) is used to detect the difference of cluster distribution between the current data subset and the last clustering result. In order to perform proper evaluation, we label the clusters and those that do not belong to any cluster are called an outlier. The result is set to perform reclustering if the difference between the clusters is large enough. Two clusters are said to be similar (resemblance), if they satisfy the condition between point pj and cluster ck i.e. 1< k<l obtains maximum of the cluster point. The resemblance for a given data point p j and an NIR table of clusters ck, is defined by the following equation: R ( , ܿ ) = ∑ ݓሺܿ , ܫ ሻ ୀଵ (1) Where ܫ is one entry in the NIR table of clusters ܿ . As shown in the equation (1), resemblance can be directly obtained by summing up the nodes’ importance in the NIR table of clustersܿ . Resemblance will be larger if data point contains nodes that are more important in one cluster than in another cluster and is considered to obtain maximal resemblance. If resemblance values between each cluster are small, then it will be treated as an outlier. Therefore, a threshold ߣ in each cluster is set to identify outliers. The decision function is defined as follows: Label = { ܿ, כ if max R ( , ܿ ሻ ≥ ߣ where 1 ≤ i ≤ l; outliers; otherwise. As shown in fig.1, the data points in the second sliding window are going to perform data labeling and thresholds are λ1 = λ2 = 0.5. The first data point p6 = (B, E, F) in S2 is decomposed 379
4.
International Journal of
Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print), ISSN 0976 – 6375(Online) Volume 3, Issue 3, October-December (2012), © IAEME into 3 nodes, i.e., {[A1=B]}, {[A2=E]}, {[A3=F]}. The resemblance of in ܿଵ is zero, and in ܿଶ ଵ ଵ it is also zero, since the maximal resemblance is not larger than the threshold, hence the data point is considered as an outlier. The resemblance of in ܿଵ is 0.037 and in ܿଶ it is ଵ ଵ 1.537(0.5 +0.037 +1). Then the maximal resemblance value is R (ܿ , ଶ ) and the resemblance ଵ value is larger than the threshold λ2 = 0.5, therefore is labeled clusterܿଶ . ଵ p1 p2 p3 p4 p5 p6 p7 p8 p9 p10 A1 C I C S C B I B S B A2 W W W W W E T E I O A3 D M N M D F H G H G S1 S2 p11 p12 p13 p14 p15 S I Z I S W W P W W P P T P P S3 ܿଵ ଵ C C C I ܿଶ Sଵ W W W W W D N D M M ܿଶ ′ଵ ܿଶ ′ଶ outliers I S B B B T T E E O H H F G G Fig. 1: The temporal clustering result ′ that is obtained by data labeling. Algorithm Used: Let temp= ܥሾ௧ ,௧ିଵሿ DriftingConceptDetecting (temp, ܵ ௧ ) outliers out = 0 while there is next tuple in ܵ ௧ do read in data point from St divide into nodes ܫଵ to ܫ for all clusters tempi in ݉݁ݐdo calculate Resemblance R(pj, tempi) end for find Maximal Resemblance tempm if R( , tempm ) ≥ ߣ then is assign to ܿ else ′௧ out = out + 1 380
5.
International Journal of
Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print), ISSN 0976 – 6375(Online) Volume 3, Issue 3, October-December (2012), © IAEME end if end while Outlier = out; {Do data labeling on current sliding window } Numdiffclusters = 0 For all clusters tempi in temp do ሾ ,షభሿ If ቤ ೖ ሾ ,షభሿ ሾ ,షభሿ െ ሾ ,షభሿ ቤ then ∑ೣసభ ೣ ∑ೖ ೣసభ ೣ Numdiffclusters = numdiffclusters + 1 end if end for ௨௧ ௨ௗ௨௦௧௦ if ே > θ or ߟ then ሾ ,షభሿ {Concept Drifts} dump out temp call initial clustering on St else {Concept not drifts} add ′ ܥ௧ into temp update NIR as ܥሾ௧ ,௧ሿ end if Since we measure the similarity between the data point and the cluster ܿ as R ( , ܿ ሻ, the cluster with the maximal resemblance is the most appropriate cluster for that data point. If the maximal resemblance (the most appropriate cluster) is smaller than the threshold ߣ in that cluster, the data point is seen as an outlier. In order to observe the relationship between different clustering results, cluster relationship analysis is used to analyze and show the changes between different clustering results. It measures the similarity of clusters between the clustering results at different time stamps and links the similar clusters. Cluster Cluster ܿଶ ଶ Cluster ܿଵ ଵ ܿଵ 0.012 ଶ 0.182 Cluster ܿଶ ଵ 0.567 0 Cluster Cluster ܿଶ ଷ Cluster ܿଵ ଶ ܿଵ 1 ଷ 0 Cluster ܿଶ ଶ 0 0 Fig. 2: The similarity table between clustering results ഥ തതത The cosine measure CM ( ܿଶ , ܿଵ ). = (1.537/1.225)* 1.578 = 0.567, which is larger than CM ଵ ଶ തതതത തതത (ܿ ଵ , ܿଵ ). Therefore cluster ܿଶ is said to be more similar to ܿଵ than to clusterܿଵ . ଶ ଵ ଶ ଵ ଵ 381
6.
International Journal of
Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print), ISSN 0976 – 6375(Online) Volume 3, Issue 3, October-December (2012), © IAEME Table 1: Symbols used in Algorithm Aa The a-th attribute in the data set. C[t1 , t2] The clustering result from t1 to t2. Ct The clustering result on sliding window t. C1t The temporal clustering result on sliding window t. Cj The j-th cluster in C. ܿప ഥ The node importance vector of ܿ . ܫ The r-th node in ܿ . |ܫ | The number of occurrence of ܫ . K The number of clusters in C. ݉ The number of data points in ܿ . N The size of sliding window. ܵ௧ The sliding window t. T The timestamp index of sliding window. ݓሺܿ , ܫ ሻ The importance of ܫ in ܿ . Θ The outlier threshold. Ε The cluster variation threshold. Η The cluster difference threshold. CM(ܿ , ܿ ) The cosine measure between cluster vectors ܿప and ܿఫ ഥ ഥ. IV RESULTS: The following table shows the results in terms of precision and recall of DCD are efficient on detecting drifting concepts. N=1000 Settings drifting precision Recall D1 35.6 0.557 0.873 D2 39.2 0.825 0.992 D3 46 0.816 0.98 D4 44.5 0.443 0.97 Fig. 3: The precision and recall of the DCD We change clustering pairs to obtain the data sets with drifting concepts and then test the detecting accuracy of algorithm DCD by those data sets. The outlier threshold θ is set to 0.1, and the cluster variation threshold ε is set to 0.1, and also, the cluster difference threshold η is set to 0.5. The number of clusters k, which is the required parameter on the initial clustering step and reclustering step, is set to the maximum number of clusters in each setting, e.g., k = 10 in D1 and k = 20 in D3. In addition, each synthetic data set is generated by randomly combining 50 clustering results on that data set setting, and the precision and recall shown in fig.3 are the averages of 20 experiments. The precision and recall are more than 80 percent when the size of the sliding window is larger than 2,000. It is a little low when the size of the sliding window is set to 1,000 because the drifting concepts often cross two windows, we only count one as a 382
7.
International Journal of
Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print), ISSN 0976 – 6375(Online) Volume 3, Issue 3, October-December (2012), © IAEME correct hit, and the other window is considered as a miss. However, the detecting recall is the highest one when the size of sliding window is set to 1,000. The drifting concepts will probably not be omitted in the sliding window when the data set is separated in detail. If we choose two examples of bank datasets that are synthesized by settings D1 and D2 and evaluate clustering results on each sliding window, it generates a new clustering results when the drifting concept is detected, it also response quickly to the trend of evolving dataset. IV. CONCLUSION In this paper we have proposed a framework to perform clustering on categorical time- evolving data. In order to detect the drifting concepts at different sliding windows, we proposed the algorithm DCD to compare the cluster distributions between the last clustering result and the temporal current clustering result. If the results are quite different, the last clustering result will be dumped out, and the current data in this sliding window will perform reclustering. In order to observe the relationship between different clustering results, cluster relationship analysis is used to analyze and show the changes between different clustering results. The experimental evaluation shows that performing DCD is faster than doing clustering once on the entire data set and DCD can provide high-quality clustering results with correctly detected drifting concepts. Therefore, the result demonstrates that our framework is practical for detecting drifting concepts in time-evolving categorical data. V.REFERENCES [1] D. Barbara, Y. Li, and J. Couto, Coolcat: An Entropy-Based Algorithm for Categorical Clustering, Proc. ACM Int’l Conf. Information and Knowledge Management (CIKM), 2002. [2] F. Cao, M. Ester, W. Qian, and A. Zhou, Density-Based Clustering over an Evolving Data Stream with Noise, Proc. Sixth SIAM Int’l Conf. Data Mining (SDM), 2006. [3] H.-L. Chen, K.-T. Chuang, and M.-S. Chen, Labeling Unclustered Categorical Data into Clusters Based on the Important Attribute Values, Proc. Fifth IEEE Int’l Conf. Data Mining (ICDM), 2005. [4] O. Nasraoui, M. Soliman, E. Saka, A. Badia, and R. Germain, A Web Usage Mining Framework for Mining Evolving User Profiles in Dynamic Web Sites, IEEE Trans. Knowledge and Data Eng., vol. 20, no. 2, pp. 202-215, Feb. 2008. [5] Hung-Leng Chen, Ming-Syan Chen, and Su-Chen Lin, Catching the Trend: A Framework for Clustering Concept-Drifting Categorical Data, IEEE Trans. Knowledge and Data Eng., vol. 21, no. 5, May 2009. [6] Z. Huang and M.K. Ng, A Fuzzy k-Modes Algorithm for Clustering Categorical Data, IEEE Trans. Fuzzy Systems, 1999. 383
Descargar ahora