Mining Regional Knowledge in Spatial Dataset

Beijing, China August 17, 2009 A Framework for Multi-objective Clustering and Its Application to Co-location Mining RachsudaJianthapthaksin, Christoph F. Eick and Ricardo Vilalta University of Houston, Texas, USA

Talk Outline What is unique about this work with respect to clustering? Multi-objective Clustering (MOC)—Objectives and an Architecture Clustering with Plug-in Fitness Functions Filling the Repository with Clusters Creating Final Clusterings Related Work Co-location Mining Case Study Conclusion and Future Work

1. What is unique about this work with respect to clustering? Clustering algorithms that support plug-in fitness function are used. Clustering algorithms are run multiple times to create clusters. Clusters are stored in a repository that is updated on the fly; cluster generation is separated from creating the final clustering. The final clustering is created from the clusters in the repository based on user preferences. Our approach needs to seeks for alternative, overlapping clusters.

2. Multi-Objective Clustering (MOC) The particular problem investigated in this work: ,[object Object]

Task: Find sets of clusters that a good with respect to two or more objectivesTexas Multi-Objective Clustering Dataset: (longitude,latitude,<concentrations>+)

Survey MOC Approach Clustering algorithms are run multiple times maximizing different subsets of objectives that are captured in compound fitness functions. Uses a repository to store promising candidates. Only clusters that satisfying two or more objectives are considered as candidates. After a sufficient number of clusters has been created, final clustering are generated based on user-preferences. 5

An Architecture for MOC S1 S2 Q’ Clustering Algorithm Goal-driven Fitness Function Generator A Spatial Dataset M X Cluster Summarization Unit Q’ M’ Storage Unit S3 S4 Steps in multi-run clustering: S1: Generate a compound fitness function. S2: Run a clustering algorithm. S3: Update the cluster repository M. S4: Summarize clusters discovered M’. 6

3. Clustering with Plug-in Fitness Functions Motivation: Finding subgroups in geo-referenced datasets has many applications. However, in many applications the subgroups to be searched for do not share the characteristics considered by traditional clustering algorithms, such as cluster compactness and separation. Domain or task knowledge frequently imposes additional requirements concerning what constitutes a “good” subgroup. Consequently, it is desirable to develop clustering algorithms that provide plug-in fitness functions that allow domain experts to express desirable characteristics of subgroups they are looking for.

Current Suite of Spatial Clustering Algorithms Representative-based: SCEC, SRIDHCR, SPAM, CLEVER Grid-based: SCMRG Agglomerative: MOSAIC Density-based: SCDE, DCONTOUR (not really plug-in but some fitness functions can be simulated) Density-based Grid-based Agglomerative-based Representative-based Clustering Algorithms Remark: All algorithms partition a dataset into clusters by maximizing a reward-based, plug-in fitness function.

4. Filling the Repository with Clusters Plug-in Reward functions Rewardq(x) are used to assess to which extend an objective q is satisfied for a cluster x. User defined thresholds q are used to determine if an objective q is satisfied by a cluster x (Rewardq(x)>q). Only clusters that satisfy 2 or more objectives are stored in the repository. Only non-dominated clusters are stored in the repository. Dominance relations only apply to pairs of clusters that have a certain degree of agreement (overlap) sim.

Dominance and Multi-Objective Clusters Dominance between clusters x and y with respect to multiple objectives Q. Dominance Constraint with Respect to the Repository 10

Compound Fitness Functions The goal-driven fitness function generator selects a subset Q’(Q) of the objectives Q and creates a compound fitness function qQ’relying on a penalty function approach [Baeck et al. 2000]. CmpReward(x)= (qQ’Rewardq(x)) * Penalty(Q’,x) 11

Updating the Cluster Repository M:= clusters in the repository X:= “new” clusters generated by a single run of the clustering algorithm 12

5. Creating a Final Clustering Final clusterings are subsets of the clusters in the repository M. Inputs: The user provides her own individual objective function RewardU and a reward threshold U and cluster similarity threshold rem that indicates how much cluster overlap she likes to tolerate. Goal: Find XM that maximizes: subject to: 1. xXx’X (xx’  Similarity(x,x’)<rem) 2. xX(RewardU(x)>U) Our paper introduces MO-Dominance-guided Cluster Reduction algorithm (MO-DCR) to create the final clustering.

MO-Dominance-guided Cluster Reduction(MO-DCR) algorithm (MO-DCR) : a dominant cluster : dominated clusters A The algorithm loops over the following 2 steps until M is empty: Include dominant clusters D which are the highest reward clusters in M’ Remove D and their dominated clusters in the rem-proximity from M. B C Dominance graphs D Remark: AB  RewardU(A)>RewardU(B)  Similarity(A,B)> rem E F sim(A,B)=0.8 rem=0.5 M’ 0.7 0.6 A E A E F 14

6. Related Work Multi-objective clustering based on evolutionary algorithms (MOEA): VIENNA [Handl and Knowles 2004] , MOCLE [Faceli et al. 2007] ,[object Object],Moreover, MOC relies on cluster repositories that store individual clusters and not clusterings and summarization algorithms to create the final clustering. 15

7. Case Study: Co-location Mining Goal: Finding regional co-location patterns where high concentrations of Arsenic are co-located with a lot of other factors in Texas. Remark: Each binary co-location is treated as a single objective. Dataset: TWDB has monitored water quality and collected the data for 105,814 wells in Texas over last 25 years. we use a subset of Arsenic_10_avg data set: longitude and latitude, Arsenic (As), Molybdenum (Mo), Vanadium (V), Boron (B), Fluoride (F-), Chloride (Cl-), Sulfate (SO42-) and Total Dissolved Solids (TDS). 16

Objective Functions Used . Q’ Q Q = {q{As,Mo}, q{As,V}, q{As,B}, q{As,F-}, q{As,Cl-}, q{As,SO42-}, q{As,TDS}} RewardB(x) = (B,x)|x| 17

Steps of the Experiment ,[object Object],q{As,Mo}, q{As,V}, q{As,B}, q{As,F-}, q{As,Cl-}, q{As,SO42-}, q{As,TDS}. ,[object Object], q{As,Mo}=13, q{As,V}=15, q{As,B}=10, q{As,F-}=25, q{As,Cl-}=7, q{As, SO42-}=6, q{As,TDS}=8. MOC Users Queries Spatial dataset and fitness functions (Q) Regions M’ (M) with associated co-location pattern MOC Step 1-3 MOC Step 4 Regions (M) 18

Experimental Results MOC is able to identify: Multi-objective clusters Alternative clusters e.g. Rank1 regions of (a) and Rank2 regions of (b) Nested clusters e.g. in (b) Rank3-5 regions are sub-regions of Rank1 region. Particularly discriminate among companion elements such as Vanadium (Rank3 region), or Chloride, Sulfate and Total Dissolved Solids (Rank4 region). (a) (b) Fig. 7.6 The top 5 regions and patterns with respect to two queries: query1={As,Mo} and query2={As,B} are shown in Figure (a) and (b), respectively. 19

8. Conclusion and Future Work Building blocks for Future Multi-Objective Clustering Systems were provided in this work; namely: A dominance relation for problems in which only a subset of the objectives can be satisfied was introduced. Clustering algorithms with plug-in fitness functions and the capability to create compound fitness functions are excessively used in our approach. Initially, a repository of potentially useful clusters is generated based on a large set of objectives. Individualized, specific clusterings are then generated based on user preferences. The approach is highly generic and incorporates specific domain needs in form of single-objective fitness functions. The approach was evaluated in a case study and turned out more suitable than a single-objective clustering approach that was used for the same application in a previous paper [ACM-GIS 2008].

Challenges in Multi-objective Clustering (MOC) Find clusters that are individually good with respect to multiple objectives in an automated fashion. Provide search engine style capabilities to summarize final clustering obtained from multiple runs of clustering algorithms. 21

Traditional Clustering Algorithms & Fitness Functions Clustering Algorithms No Fitness Function Fixed Fitness Function Provides Plug-in Fitness Function Implicit Fitness Function DBSCAN Hierarchical Clustering CHAMELEON Our Work PAM K-Means Traditional clustering algorithms consider only domain independent and task independent characteristics to form a solution. Different domain tasks require different fitness functions. Traditional Clustering Algorithms 22

Challenges Cluster Summarization Original Clusters A X A B : Eliminated clusters X B A C C X B X C Typical Output DCR Output 24

Interestingness of a Pattern Interestingness of a pattern B (e.g. B= {C, D, E}) for an object o, Interestingness of a pattern B for a region c, Remark: Purity (i(B,o)>0) measures the percentage of objects that exhibit pattern B in region c.

Characteristics of the Top5 Regions Table 7.7 Top 5 Regions Ranked by Reward of the Query {As,Mo} Table 7.8 Top 5 Regions Ranked by Reward of the Query {As, B} 26

Representative-based Clustering 2 Attribute1 1 3 Attribute2 4 Objective: Find a set of objects OR such that the clustering X obtained by using the objects in OR as representatives minimizes q(X). Properties: Cluster shapes are convex polygons Popular Algorithms: K-means. K-medoids

Mining Regional Knowledge in Spatial Dataset

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (8)

Similar a Mining Regional Knowledge in Spatial Dataset

Similar a Mining Regional Knowledge in Spatial Dataset (20)

Más de butest

Más de butest (20)

Mining Regional Knowledge in Spatial Dataset

Notas del editor