SlideShare una empresa de Scribd logo
1 de 16
Descargar para leer sin conexión
Cluster Analysis




    Cluster Analysis
What is Cluster Analysis?
Types of Data in Cluster Analysis
A Categorization of Major Clustering Methods
Partitioning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Model-Based Clustering Methods
Outlier Analysis
Summary




                                               1
What is Cluster Analysis?
Cluster: a collection of data objects
  Similar to one another within the same cluster
  Dissimilar to the objects in other clusters
Cluster analysis
  Grouping a set of data objects into clusters
Clustering is unsupervised classification: no
predefined classes
Clustering is used:
  As a stand-alone tool to get insight into data distribution
     Visualization of clusters may unveil important information
  As a preprocessing step for other algorithms
     Efficient indexing or compression often relies on clustering




   General Applications of Clustering
Pattern Recognition
Spatial Data Analysis
   create thematic maps in GIS by clustering feature spaces
   detect spatial clusters and explain them in spatial data
   mining
Image Processing
   cluster images based on their visual content
Economic Science (especially market research)
WWW and IR
   document classification
   cluster Weblog data to discover groups of similar access
   patterns




                                                                    2
What Is Good Clustering?
A good clustering method will produce high
quality clusters with
   high intra-class similarity
   low inter-class similarity
The quality of a clustering result depends on both
the similarity measure used by the method and its
implementation.
The quality of a clustering method is also
measured by its ability to discover some or all of
the hidden patterns.




Requirements of Clustering in Data
Mining
  Scalability
  Ability to deal with different types of attributes
  Discovery of clusters with arbitrary shape
  Minimal requirements for domain knowledge to
  determine input parameters
  Able to deal with noise and outliers
  Insensitive to order of input records
  High dimensionality
  Incorporation of user-specified constraints
  Interpretability and usability




                                                       3
Outliers
  Outliers are objects that do not belong to any
  cluster or form clusters of very small cardinality



cluster


                                            outliers

  In some applications we are interested in
  discovering outliers, not clusters (outlier
  analysis)




      Cluster Analysis
 What is Cluster Analysis?
 Types of Data in Cluster Analysis
 A Categorization of Major Clustering Methods
 Partitioning Methods
 Hierarchical Methods
 Density-Based Methods
 Grid-Based Methods
 Model-Based Clustering Methods
 Outlier Analysis
 Summary




                                                       4
Data Structures
                                                            attributes/dimensions
data matrix
                                                   ⎡ x 11     ...     x 1f       ...    x 1p ⎤
    (two modes)                                    ⎢                                         ⎥




                                  tuples/objects
                                                   ⎢ ...      ...      ...       ...     ... ⎥
                                                   ⎢x         ...     x if       ...    x ip ⎥
the “classic” data input                           ⎢ i1                                      ⎥
                                                   ⎢ ...      ...       ...      ...     ... ⎥
                                                   ⎢x         ...     x nf       ...    x np ⎥
                                                   ⎢ n1
                                                   ⎣                                         ⎥
                                                                                             ⎦
                                                                     objects
dissimilarity or distance                          ⎡ 0                                         ⎤
         matrix                                    ⎢ d(2,1)            0                       ⎥
                                                   ⎢                                           ⎥

                                        objects
    (one mode)                                     ⎢ d(3,1 )        d ( 3,2 )      0           ⎥
                                                   ⎢                                           ⎥
the desired data input to some                     ⎢ :                  :          :           ⎥
clustering algorithms                              ⎢ d ( n ,1)
                                                   ⎣                d ( n ,2 )    ...    ... 0 ⎥
                                                                                               ⎦




Measuring Similarity in Clustering
 Dissimilarity/Similarity metric:

     The dissimilarity d(i, j) between two objects i and j is
     expressed in terms of a distance function, which is
     typically a metric:
                  metric
     d(i, j)≥0 (non-negativity)
     d(i, i)=0 (isolation)
     d(i, j)= d(j, i) (symmetry)
     d(i, j) ≤ d(i, h)+d(h, j) (triangular inequality)

 The definitions of distance functions are usually
 different for interval-scaled, boolean, categorical,
 ordinal and ratio-scaled variables.

 Weights may be associated with different variables
 based on applications and data semantics.




                                                                                                   5
Type of data in cluster analysis
     Interval-scaled variables
        e.g., salary, height

     Binary variables
        e.g., gender (M/F), has_cancer(T/F)

     Nominal (categorical) variables
        e.g., religion (Christian, Muslim, Buddhist, Hindu, etc.)

     Ordinal variables
        e.g., military rank (soldier, sergeant, lutenant, captain, etc.)

     Ratio-scaled variables
        population growth (1,10,100,1000,...)

     Variables of mixed types
        multiple attributes with various types




Similarity and Dissimilarity Between Objects
  Distance metrics are normally used to measure
  the similarity or dissimilarity between two data
  objects
  The most popular conform to Minkowski distance:
                      ⎛          p           p              p ⎞1/ p
         L p (i, j) = ⎜| x − x | + | x − x | +...+ | x − x | ⎟
                      ⎜
                      ⎜ i1
                      ⎝       j1      i2 j 2          in jn ⎟ ⎟
                                                              ⎠


   where i = (xi1, xi2, …, xin) and j = (xj1, xj2, …, xjn) are two
    n-dimensional data objects, and p is a positive integer
  If p = 1, L1 is the Manhattan (or city block)
  distance: L (i, j) =| x −x | +| x −x | +...+| x −x |
                   1           i1   j1   i2   j2        in   jn




                                                                           6
Similarity and Dissimilarity Between
Objects (Cont.)
  If p = 2, L2 is the Euclidean distance:
             d (i, j) = (| x − x |2 + | x − x |2 +...+ | x − x |2 )
                            i1 j1        i2 j 2           in  jn
      Properties
         d(i,j) ≥ 0
         d(i,i) = 0
         d(i,j) = d(j,i)
         d(i,j) ≤ d(i,k) + d(k,j)
  Also one can use weighted distance:
        d (i, j) = (w | x − x |2 + w | x − x |2 +...+ wn | x − x |2 )
                     1 i1 j1        2 i2    j2              in  jn




 Binary Variables
  A binary variable has two states: 0 absent, 1 present
  A contingency table for binary data                    object j
                                                1     0               sum
                                           1    a     b               a +b
                           object i        0    c     d               c+d
                                          sum a + c b + d               p
  Simple matching coefficient distance (invariant, if the binary
  variable is symmetric):                          b+c
                                   d (i, j ) =
                                                 a+b+c+d
  Jaccard coefficient distance (noninvariant if the binary
  variable is asymmetric): d ( i , j ) =          b+c
                                                 a+b+c




                                                                             7
Binary Variables
  Another approach is to define the similarity of two
  objects and not their distance.
  In that case we have the following:
    Simple matching coefficient similarity:

                                     s (i, j ) =     a+d
                                                   a+b+c+d
    Jaccard coefficient similarity:

                                     s (i, j ) =     a
                                                   a+b+c

   Note that: s(i,j) = 1 – d(i,j)




Dissimilarity between Binary Variables
    Example (Jaccard coefficient)
    Name    Fever   Cough   Test-1   Test-2   Test-3   Test-4
    Jack    1       0       1        0        0        0
    Mary    1       0       1        0        1        0
    Jim     1       1       0        0        0        0
      all attributes are asymmetric binary
      1 denotes presence or positive test
      0 denotes absence or negative test
                                0+1
           d ( jack , mary ) =         = 0 . 33
                               2+ 0+1
                               1+1
           d ( jack , jim ) =        = 0 . 67
                              1+1+1
                               1+ 2
           d ( jim , mary ) =         = 0 . 75
                              1+1+ 2




                                                                8
A simpler definition
 Each variable is mapped to a bitmap (binary vector)
    Name   Fever    Cough        Test-1   Test-2    Test-3    Test-4
    Jack   1        0            1        0         0         0
    Mary   1        0            1        0         1         0
    Jim    1        1            0        0         0         0
   Jack:   101000
   Mary:   101010
   Jim:    110000
 Simple match distance:
                           number of non - common bit positions
             d (i, j ) =
                                   total number of bits
 Jaccard coefficient:
                              number of 1' s in i ∧ j
           d (i , j ) = 1 −
                              number of 1' s in i ∨ j




 Variables of Mixed Types
  A database may contain all the six types of
  variables
    symmetric binary, asymmetric binary, nominal, ordinal,
    interval and ratio-scaled.
  One may use a weighted formula to combine their
  effects.
                                      Σ   p
                                                δ ( f )
                                                         d  ( f )
                 d (i, j ) =              f = 1  ij        ij
                                          Σ  p
                                             f = 1
                                                    δ  ( f )
                                                      ij




                                                                       9
Cluster Analysis
 What is Cluster Analysis?
 Types of Data in Cluster Analysis
 A Categorization of Major Clustering Methods
 Partitioning Methods
 Hierarchical Methods
 Density-Based Methods
 Grid-Based Methods
 Model-Based Clustering Methods
 Outlier Analysis
 Summary




Major Clustering Approaches
Partitioning algorithms: Construct random partitions and
then iteratively refine them by some criterion
Hierarchical algorithms: Create a hierarchical
decomposition of the set of data (or objects) using some
criterion
Density-based: based on connectivity and density
functions
Grid-based: based on a multiple-level granularity structure
Model-based: A model is hypothesized for each of the
clusters and the idea is to find the best fit of that model to
each other




                                                                 10
Cluster Analysis
   What is Cluster Analysis?
   Types of Data in Cluster Analysis
   A Categorization of Major Clustering Methods
   Partitioning Methods
   Hierarchical Methods
   Density-Based Methods
   Grid-Based Methods
   Model-Based Clustering Methods
   Outlier Analysis
   Summary




Partitioning Algorithms: Basic Concepts
 Partitioning method: Construct a partition of a
 database D of n objects into a set of k clusters
 Given a k, find a partition of k clusters that
 optimizes the chosen partitioning criterion
   Global optimal: exhaustively enumerate all partitions
   Heuristic methods: k-means and k-medoids algorithms
   k-means (MacQueen’67): Each cluster is represented by
   the center of the cluster
   k-medoids or PAM (Partition around medoids) (Kaufman &
   Rousseeuw’87): Each cluster is represented by one of the
   objects in the cluster




                                                              11
The k-means Clustering Method
 Given k, the k-means algorithm is
 implemented in 4 steps:
 1.   Partition objects into k nonempty subsets
 2.   Compute seed points as the centroids of
      the clusters of the current partition. The
      centroid is the center (mean point) of the
      cluster.
 3.   Assign each object to the cluster with the
      nearest seed point.
 4.   Go back to Step 2, stop when no more new
      assignment.




The k-means Clustering Method
Example
        10                                                                                             10

         9                                                                                              9

         8                                                                                              8

         7                                                                                              7

         6                                                                                              6

         5                                                                                              5

         4                                                                                              4

         3                                                                                              3

         2                                                                                              2

         1                                                                                              1

         0                                                                                              0
             0       1       2       3       4       5       6       7       8       9       10             0       1       2       3       4       5       6       7       8       9       10




         10                                                                                             10

             9                                                                                              9

             8                                                                                              8

             7                                                                                              7

             6                                                                                              6

             5                                                                                              5

             4                                                                                              4

             3                                                                                              3

             2                                                                                              2

             1                                                                                              1

             0                                                                                              0
                 0       1       2       3       4       5       6       7       8       9        10            0       1       2       3       4       5       6       7       8       9        10




                                                                                                                                                                                                      12
Comments on the k-means Method
Strength
  Relatively efficient: O(tkn), where n is # objects, k is #
  clusters, and t is # iterations. Normally, k, t << n.
  Often terminates at a local optimum.
Weaknesses
  Applicable only when mean is defined, then what about
  categorical data?
  Need to specify k, the number of clusters, in advance
  Unable to handle noisy data and outliers
  Not suitable to discover clusters with non-convex shapes




The K-Medoids Clustering Method
 Find representative objects, called medoids, in
 clusters
 PAM (Partitioning Around Medoids, 1987)
   starts from an initial set of medoids and iteratively
   replaces one of the medoids by one of the non-medoids if
   it improves the total distance of the resulting clustering
   PAM works effectively for small data sets, but does not
   scale well for large data sets
 CLARA (Kaufmann & Rousseeuw, 1990)
 CLARANS (Ng & Han, 1994): Randomized
 sampling




                                                                13
PAM (Partitioning Around Medoids)
(1987)
   PAM (Kaufman and Rousseeuw, 1987), built in
   statistical package S+
   Use real object to represent the cluster
  1.   Select k representative objects arbitrarily
  2.   For each pair of non-selected object h and selected
       object i, calculate the total swapping cost TCih
  3.   For each pair of i and h,
          If TCih < 0, i is replaced by h
          Then assign each non-selected object to the most
          similar representative object
  4.   repeat steps 2-3 until there is no change




PAM Clustering: Total swapping cost
TCih=∑jCjih
  i is a current medoid, h is a non-
  selected object
  Assume that i is replaced by h in the
  set of medoids
  TCih = 0;
  For each non-selected object j ≠ h:
       TCih += d(j,new_medj)-d(j,prev_medj):
         new_medj = the closest medoid to j after i is
         replaced by h
         prev_medj = the closest medoid to j before i
         is replaced by h




                                                             14
PAM Clustering: Total swapping cost
TCih=∑jCjih
     10                                                             10

      9                                                             9


                                                                                     t               j
      8

      7
                  t                                                 8

                                                                    7

      6

      5
                                                  j                 6

                                                                    5

      4
                                      i           h                 4
                                                                                                                     h
      3

      2
                                                                    3

                                                                    2
                                                                                                                     i
      1                                                             1

      0                                                             0
          0   1   2   3   4       5       6   7       8   9   10         0   1   2       3       4       5       6           7       8       9       10




     Cjih = d(j, h) - d(j, i)                                      Cjih = 0

                                                                    10
     10

                                                                     9
      9
                                                                     8
      8
                  h                                                  7
      7

      6
                                      j                              6


      5
                                                                     5                       i
                              i                                      4
                                                                                                                     h                       j
      4

      3
                                          t                          3


      2
                                                                     2

                                                                     1
                                                                                                                         t
      1
                                                                     0
      0
                                                                         0   1   2       3       4           5       6           7       8       9    10
          0   1   2   3   4       5       6   7       8   9   10




    Cjih = d(j, t) - d(j, i)                                       Cjih = d(j, h) - d(j, t)




CLARA (Clustering Large Applications)
 CLARA (Kaufmann and Rousseeuw in 1990)
   Built in statistical analysis packages, such as S+
 It draws multiple samples of the data set, applies
 PAM on each sample, and gives the best
 clustering as the output
 Strength: deals with larger data sets than PAM
 Weakness:
   Efficiency depends on the sample size
   A good clustering based on samples will not necessarily
   represent a good clustering of the whole data set if the
   sample is biased




                                                                                                                                                           15
CLARANS (“Randomized” CLARA)
CLARANS (A Clustering Algorithm based on Randomized
Search) (Ng and Han’94)
CLARANS draws sample of neighbors dynamically
The clustering process can be presented as searching a graph
where every node is a potential solution, that is, a set of k
medoids
If the local optimum is found, CLARANS starts with new
randomly selected node in search for a new local optimum
It is more efficient and scalable than both PAM and CLARA
Focusing techniques and spatial access structures may
further improve its performance (Ester et al.’95)




                                                                16

Más contenido relacionado

La actualidad más candente

Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
ankur bhalla
 
Bayseian decision theory
Bayseian decision theoryBayseian decision theory
Bayseian decision theory
sia16
 
Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...
Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...
Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...
Simplilearn
 
Decision tree lecture 3
Decision tree lecture 3Decision tree lecture 3
Decision tree lecture 3
Laila Fatehy
 

La actualidad más candente (20)

Random forest
Random forestRandom forest
Random forest
 
3.1 clustering
3.1 clustering3.1 clustering
3.1 clustering
 
Association rule mining
Association rule miningAssociation rule mining
Association rule mining
 
Decision tree
Decision treeDecision tree
Decision tree
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Bayseian decision theory
Bayseian decision theoryBayseian decision theory
Bayseian decision theory
 
Graph Analytics
Graph AnalyticsGraph Analytics
Graph Analytics
 
Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...
Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...
Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...
 
3. mining frequent patterns
3. mining frequent patterns3. mining frequent patterns
3. mining frequent patterns
 
Ensemble learning
Ensemble learningEnsemble learning
Ensemble learning
 
OLAP operations
OLAP operationsOLAP operations
OLAP operations
 
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic ConceptsData Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts
 
Fuzzy Clustering(C-means, K-means)
Fuzzy Clustering(C-means, K-means)Fuzzy Clustering(C-means, K-means)
Fuzzy Clustering(C-means, K-means)
 
3.7 outlier analysis
3.7 outlier analysis3.7 outlier analysis
3.7 outlier analysis
 
Dbscan algorithom
Dbscan algorithomDbscan algorithom
Dbscan algorithom
 
Unsupervised learning clustering
Unsupervised learning clusteringUnsupervised learning clustering
Unsupervised learning clustering
 
Data mining primitives
Data mining primitivesData mining primitives
Data mining primitives
 
Data mining Measuring similarity and desimilarity
Data mining Measuring similarity and desimilarityData mining Measuring similarity and desimilarity
Data mining Measuring similarity and desimilarity
 
5.3 mining sequential patterns
5.3 mining sequential patterns5.3 mining sequential patterns
5.3 mining sequential patterns
 
Decision tree lecture 3
Decision tree lecture 3Decision tree lecture 3
Decision tree lecture 3
 

Similar a Cluster analysis

Ann chapter-3-single layerperceptron20021031
Ann chapter-3-single layerperceptron20021031Ann chapter-3-single layerperceptron20021031
Ann chapter-3-single layerperceptron20021031
frdos
 
Unit-1 Introduction and Mathematical Preliminaries.pptx
Unit-1 Introduction and Mathematical Preliminaries.pptxUnit-1 Introduction and Mathematical Preliminaries.pptx
Unit-1 Introduction and Mathematical Preliminaries.pptx
avinashBajpayee1
 
4 image segmentation through clustering
4 image segmentation through clustering4 image segmentation through clustering
4 image segmentation through clustering
IAEME Publication
 

Similar a Cluster analysis (20)

Clustering
ClusteringClustering
Clustering
 
Cluster Analysis
Cluster AnalysisCluster Analysis
Cluster Analysis
 
Dbm630 lecture09
Dbm630 lecture09Dbm630 lecture09
Dbm630 lecture09
 
Chapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; KamberChapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
 
Ann chapter-3-single layerperceptron20021031
Ann chapter-3-single layerperceptron20021031Ann chapter-3-single layerperceptron20021031
Ann chapter-3-single layerperceptron20021031
 
09Evaluation_Clustering.pdf
09Evaluation_Clustering.pdf09Evaluation_Clustering.pdf
09Evaluation_Clustering.pdf
 
Bayes ML.ppt
Bayes ML.pptBayes ML.ppt
Bayes ML.ppt
 
similarities-knn-1.ppt
similarities-knn-1.pptsimilarities-knn-1.ppt
similarities-knn-1.ppt
 
Pr1
Pr1Pr1
Pr1
 
Pattern learning and recognition on statistical manifolds: An information-geo...
Pattern learning and recognition on statistical manifolds: An information-geo...Pattern learning and recognition on statistical manifolds: An information-geo...
Pattern learning and recognition on statistical manifolds: An information-geo...
 
PR07.pdf
PR07.pdfPR07.pdf
PR07.pdf
 
Lect4
Lect4Lect4
Lect4
 
Principal Component Analysis For Novelty Detection
Principal Component Analysis For Novelty DetectionPrincipal Component Analysis For Novelty Detection
Principal Component Analysis For Novelty Detection
 
Unit-1 Introduction and Mathematical Preliminaries.pptx
Unit-1 Introduction and Mathematical Preliminaries.pptxUnit-1 Introduction and Mathematical Preliminaries.pptx
Unit-1 Introduction and Mathematical Preliminaries.pptx
 
Pres metabief2020jmm
Pres metabief2020jmmPres metabief2020jmm
Pres metabief2020jmm
 
ESRA2015 course: Latent Class Analysis for Survey Research
ESRA2015 course: Latent Class Analysis for Survey ResearchESRA2015 course: Latent Class Analysis for Survey Research
ESRA2015 course: Latent Class Analysis for Survey Research
 
YSC 2013
YSC 2013YSC 2013
YSC 2013
 
The Perceptron (D1L2 Deep Learning for Speech and Language)
The Perceptron (D1L2 Deep Learning for Speech and Language)The Perceptron (D1L2 Deep Learning for Speech and Language)
The Perceptron (D1L2 Deep Learning for Speech and Language)
 
similarities-knn.pptx
similarities-knn.pptxsimilarities-knn.pptx
similarities-knn.pptx
 
4 image segmentation through clustering
4 image segmentation through clustering4 image segmentation through clustering
4 image segmentation through clustering
 

Cluster analysis

  • 1. Cluster Analysis Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods Hierarchical Methods Density-Based Methods Grid-Based Methods Model-Based Clustering Methods Outlier Analysis Summary 1
  • 2. What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping a set of data objects into clusters Clustering is unsupervised classification: no predefined classes Clustering is used: As a stand-alone tool to get insight into data distribution Visualization of clusters may unveil important information As a preprocessing step for other algorithms Efficient indexing or compression often relies on clustering General Applications of Clustering Pattern Recognition Spatial Data Analysis create thematic maps in GIS by clustering feature spaces detect spatial clusters and explain them in spatial data mining Image Processing cluster images based on their visual content Economic Science (especially market research) WWW and IR document classification cluster Weblog data to discover groups of similar access patterns 2
  • 3. What Is Good Clustering? A good clustering method will produce high quality clusters with high intra-class similarity low inter-class similarity The quality of a clustering result depends on both the similarity measure used by the method and its implementation. The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns. Requirements of Clustering in Data Mining Scalability Ability to deal with different types of attributes Discovery of clusters with arbitrary shape Minimal requirements for domain knowledge to determine input parameters Able to deal with noise and outliers Insensitive to order of input records High dimensionality Incorporation of user-specified constraints Interpretability and usability 3
  • 4. Outliers Outliers are objects that do not belong to any cluster or form clusters of very small cardinality cluster outliers In some applications we are interested in discovering outliers, not clusters (outlier analysis) Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods Hierarchical Methods Density-Based Methods Grid-Based Methods Model-Based Clustering Methods Outlier Analysis Summary 4
  • 5. Data Structures attributes/dimensions data matrix ⎡ x 11 ... x 1f ... x 1p ⎤ (two modes) ⎢ ⎥ tuples/objects ⎢ ... ... ... ... ... ⎥ ⎢x ... x if ... x ip ⎥ the “classic” data input ⎢ i1 ⎥ ⎢ ... ... ... ... ... ⎥ ⎢x ... x nf ... x np ⎥ ⎢ n1 ⎣ ⎥ ⎦ objects dissimilarity or distance ⎡ 0 ⎤ matrix ⎢ d(2,1) 0 ⎥ ⎢ ⎥ objects (one mode) ⎢ d(3,1 ) d ( 3,2 ) 0 ⎥ ⎢ ⎥ the desired data input to some ⎢ : : : ⎥ clustering algorithms ⎢ d ( n ,1) ⎣ d ( n ,2 ) ... ... 0 ⎥ ⎦ Measuring Similarity in Clustering Dissimilarity/Similarity metric: The dissimilarity d(i, j) between two objects i and j is expressed in terms of a distance function, which is typically a metric: metric d(i, j)≥0 (non-negativity) d(i, i)=0 (isolation) d(i, j)= d(j, i) (symmetry) d(i, j) ≤ d(i, h)+d(h, j) (triangular inequality) The definitions of distance functions are usually different for interval-scaled, boolean, categorical, ordinal and ratio-scaled variables. Weights may be associated with different variables based on applications and data semantics. 5
  • 6. Type of data in cluster analysis Interval-scaled variables e.g., salary, height Binary variables e.g., gender (M/F), has_cancer(T/F) Nominal (categorical) variables e.g., religion (Christian, Muslim, Buddhist, Hindu, etc.) Ordinal variables e.g., military rank (soldier, sergeant, lutenant, captain, etc.) Ratio-scaled variables population growth (1,10,100,1000,...) Variables of mixed types multiple attributes with various types Similarity and Dissimilarity Between Objects Distance metrics are normally used to measure the similarity or dissimilarity between two data objects The most popular conform to Minkowski distance: ⎛ p p p ⎞1/ p L p (i, j) = ⎜| x − x | + | x − x | +...+ | x − x | ⎟ ⎜ ⎜ i1 ⎝ j1 i2 j 2 in jn ⎟ ⎟ ⎠ where i = (xi1, xi2, …, xin) and j = (xj1, xj2, …, xjn) are two n-dimensional data objects, and p is a positive integer If p = 1, L1 is the Manhattan (or city block) distance: L (i, j) =| x −x | +| x −x | +...+| x −x | 1 i1 j1 i2 j2 in jn 6
  • 7. Similarity and Dissimilarity Between Objects (Cont.) If p = 2, L2 is the Euclidean distance: d (i, j) = (| x − x |2 + | x − x |2 +...+ | x − x |2 ) i1 j1 i2 j 2 in jn Properties d(i,j) ≥ 0 d(i,i) = 0 d(i,j) = d(j,i) d(i,j) ≤ d(i,k) + d(k,j) Also one can use weighted distance: d (i, j) = (w | x − x |2 + w | x − x |2 +...+ wn | x − x |2 ) 1 i1 j1 2 i2 j2 in jn Binary Variables A binary variable has two states: 0 absent, 1 present A contingency table for binary data object j 1 0 sum 1 a b a +b object i 0 c d c+d sum a + c b + d p Simple matching coefficient distance (invariant, if the binary variable is symmetric): b+c d (i, j ) = a+b+c+d Jaccard coefficient distance (noninvariant if the binary variable is asymmetric): d ( i , j ) = b+c a+b+c 7
  • 8. Binary Variables Another approach is to define the similarity of two objects and not their distance. In that case we have the following: Simple matching coefficient similarity: s (i, j ) = a+d a+b+c+d Jaccard coefficient similarity: s (i, j ) = a a+b+c Note that: s(i,j) = 1 – d(i,j) Dissimilarity between Binary Variables Example (Jaccard coefficient) Name Fever Cough Test-1 Test-2 Test-3 Test-4 Jack 1 0 1 0 0 0 Mary 1 0 1 0 1 0 Jim 1 1 0 0 0 0 all attributes are asymmetric binary 1 denotes presence or positive test 0 denotes absence or negative test 0+1 d ( jack , mary ) = = 0 . 33 2+ 0+1 1+1 d ( jack , jim ) = = 0 . 67 1+1+1 1+ 2 d ( jim , mary ) = = 0 . 75 1+1+ 2 8
  • 9. A simpler definition Each variable is mapped to a bitmap (binary vector) Name Fever Cough Test-1 Test-2 Test-3 Test-4 Jack 1 0 1 0 0 0 Mary 1 0 1 0 1 0 Jim 1 1 0 0 0 0 Jack: 101000 Mary: 101010 Jim: 110000 Simple match distance: number of non - common bit positions d (i, j ) = total number of bits Jaccard coefficient: number of 1' s in i ∧ j d (i , j ) = 1 − number of 1' s in i ∨ j Variables of Mixed Types A database may contain all the six types of variables symmetric binary, asymmetric binary, nominal, ordinal, interval and ratio-scaled. One may use a weighted formula to combine their effects. Σ p δ ( f ) d ( f ) d (i, j ) = f = 1 ij ij Σ p f = 1 δ ( f ) ij 9
  • 10. Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods Hierarchical Methods Density-Based Methods Grid-Based Methods Model-Based Clustering Methods Outlier Analysis Summary Major Clustering Approaches Partitioning algorithms: Construct random partitions and then iteratively refine them by some criterion Hierarchical algorithms: Create a hierarchical decomposition of the set of data (or objects) using some criterion Density-based: based on connectivity and density functions Grid-based: based on a multiple-level granularity structure Model-based: A model is hypothesized for each of the clusters and the idea is to find the best fit of that model to each other 10
  • 11. Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods Hierarchical Methods Density-Based Methods Grid-Based Methods Model-Based Clustering Methods Outlier Analysis Summary Partitioning Algorithms: Basic Concepts Partitioning method: Construct a partition of a database D of n objects into a set of k clusters Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion Global optimal: exhaustively enumerate all partitions Heuristic methods: k-means and k-medoids algorithms k-means (MacQueen’67): Each cluster is represented by the center of the cluster k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87): Each cluster is represented by one of the objects in the cluster 11
  • 12. The k-means Clustering Method Given k, the k-means algorithm is implemented in 4 steps: 1. Partition objects into k nonempty subsets 2. Compute seed points as the centroids of the clusters of the current partition. The centroid is the center (mean point) of the cluster. 3. Assign each object to the cluster with the nearest seed point. 4. Go back to Step 2, stop when no more new assignment. The k-means Clustering Method Example 10 10 9 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1 1 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 10 10 9 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1 1 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 12
  • 13. Comments on the k-means Method Strength Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n. Often terminates at a local optimum. Weaknesses Applicable only when mean is defined, then what about categorical data? Need to specify k, the number of clusters, in advance Unable to handle noisy data and outliers Not suitable to discover clusters with non-convex shapes The K-Medoids Clustering Method Find representative objects, called medoids, in clusters PAM (Partitioning Around Medoids, 1987) starts from an initial set of medoids and iteratively replaces one of the medoids by one of the non-medoids if it improves the total distance of the resulting clustering PAM works effectively for small data sets, but does not scale well for large data sets CLARA (Kaufmann & Rousseeuw, 1990) CLARANS (Ng & Han, 1994): Randomized sampling 13
  • 14. PAM (Partitioning Around Medoids) (1987) PAM (Kaufman and Rousseeuw, 1987), built in statistical package S+ Use real object to represent the cluster 1. Select k representative objects arbitrarily 2. For each pair of non-selected object h and selected object i, calculate the total swapping cost TCih 3. For each pair of i and h, If TCih < 0, i is replaced by h Then assign each non-selected object to the most similar representative object 4. repeat steps 2-3 until there is no change PAM Clustering: Total swapping cost TCih=∑jCjih i is a current medoid, h is a non- selected object Assume that i is replaced by h in the set of medoids TCih = 0; For each non-selected object j ≠ h: TCih += d(j,new_medj)-d(j,prev_medj): new_medj = the closest medoid to j after i is replaced by h prev_medj = the closest medoid to j before i is replaced by h 14
  • 15. PAM Clustering: Total swapping cost TCih=∑jCjih 10 10 9 9 t j 8 7 t 8 7 6 5 j 6 5 4 i h 4 h 3 2 3 2 i 1 1 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 Cjih = d(j, h) - d(j, i) Cjih = 0 10 10 9 9 8 8 h 7 7 6 j 6 5 5 i i 4 h j 4 3 t 3 2 2 1 t 1 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 Cjih = d(j, t) - d(j, i) Cjih = d(j, h) - d(j, t) CLARA (Clustering Large Applications) CLARA (Kaufmann and Rousseeuw in 1990) Built in statistical analysis packages, such as S+ It draws multiple samples of the data set, applies PAM on each sample, and gives the best clustering as the output Strength: deals with larger data sets than PAM Weakness: Efficiency depends on the sample size A good clustering based on samples will not necessarily represent a good clustering of the whole data set if the sample is biased 15
  • 16. CLARANS (“Randomized” CLARA) CLARANS (A Clustering Algorithm based on Randomized Search) (Ng and Han’94) CLARANS draws sample of neighbors dynamically The clustering process can be presented as searching a graph where every node is a potential solution, that is, a set of k medoids If the local optimum is found, CLARANS starts with new randomly selected node in search for a new local optimum It is more efficient and scalable than both PAM and CLARA Focusing techniques and spatial access structures may further improve its performance (Ester et al.’95) 16