SlideShare a Scribd company logo
1 of 92
Download to read offline
DBM630: Data Mining and
                       Data Warehousing

                              MS.IT. Rangsit University
                                                 Semester 2/2011




                                               Lecture 9
                                           Clustering

    by Kritsada Sriphaew (sriphaew.k AT gmail.com)

1
Topics
 What  is Cluster Analysis?
 Types of Attributes in Cluster Analysis
 Major Clustering Approaches
      Partitioning Algorithms
      Hierarchical Algorithms




 2                           Data Warehousing and Data Mining by Kritsada Sriphaew
Classification vs. Clustering
               Classification: Supervised learning
               Learns a method for predicting the
                 instance class from pre-labeled
                 (classified) instances




3                                         Clustering Analysis
Clustering

             Unsupervised learning:
             Finds “natural” grouping of
               instances given un-labeled data




4                                  Clustering Analysis
Clustering Methods
   Many different method and algorithms:
        For numeric and/or symbolic data
        Deterministic vs. probabilistic
        Exclusive vs. overlapping
        Hierarchical vs. flat
        Top-down vs. bottom-up




5                                           Clustering Analysis
What is Cluster Analysis ?
       Cluster: a collection of data objects
         High similarity of objects within a cluster
         Low similarity of objects across clusters
       Cluster analysis
         Grouping a set of data objects into clusters
       Clustering is an unsupervised classification: no predefined
        classes
       Typical applications
         As a stand-alone tool to get insight into data distribution
         As a preprocessing step for other algorithms



    6                                                       Clustering Analysis
General Applications of Clustering
       Pattern Recognition
          In biology, derive plant and animal taxonomies, categorize genes
       Spatial Data Analysis
         create thematic maps in GIS by clustering feature spaces
         detect spatial clusters and explain them in spatial data mining
       Image Processing
       Economic Science (e.g. market research)
           discovering distinct groups in their customer bases and characterize customer
            groups based on purchasing patterns
       WWW
         Document classification
         Cluster Weblog data to discover groups of similar access patterns



    7                                                                          Clustering Analysis
Examples of Clustering Applications
       Marketing: Help marketers discover distinct groups in their
        customer bases, and then use this knowledge to develop targeted
        marketing programs
       Land use: Identification of areas of similar land use in an earth
        observation database
       Insurance: Identifying groups of motor insurance policy holders
        with a high average claim cost
       City-planning: Identifying groups of houses according to their
        house type, value, and geographical location
       Earth-quake studies: Observed earth quake epicenters should be
        clustered along continent faults

    8                                                        Clustering Analysis
Clustering Evaluation
 Manual inspection
 Benchmarking on existing labels
 Cluster quality measures
        distance measures
        high similarity within a cluster, low across clusters




 9                                                           Clustering Analysis
Criteria for Clustering
    A good clustering method will produce high quality clusters
     with
      high intra-class similarity
      low inter-class similarity
    The quality of a clustering result depends on:
        both the similarity measure used by the method and its
         implementation to the new cases
        ability to discover some or all of the hidden patterns




    10                                                       Clustering Analysis
Requirements of Clustering in Data Mining
 Scalability
 Ability to deal with different types of attributes
 Discovery of clusters with arbitrary shape
 Minimal requirements for domain knowledge to
  determine input parameters
 Able to deal with noise and outliers
 Insensitive to order of input records
 High dimensionality
 Incorporation of user-specified constraints
 Interpretability and usability


 11                                              Clustering Analysis
The distance function
   Simplest case: one numeric attribute A
         Distance(X,Y) = A(X) – A(Y)
   Several numeric attributes:
         Distance(X,Y) = Euclidean distance between X,Y
 Nominal attributes: distance is set to 1 if values are
  different, 0 if they are equal
 Are all attributes equally important?
         Weighting the attributes might be necessary


 12                                                     Clustering Analysis
From Data Matrix to Similarity or Dissimilarity Matrices
    Data matrix (or object-by-attribute structure)
        m objects with n attributes, e.g., relational data
                              x11    x1 j       x1n 
                                               
                             x       xij        xin 
                              i1                      
                                               
                              xm1
                                     xmj        xmn 
                                                       

    Similarity and dissimilarity matrices
        a collection of proximities for all pairs of m objects.

       1                                             0                 
                                                                   
       s      1                                    d       0         
        i1                                           i1                
                        sij  s ji                             dij  d ji
    13  m1
       s      smj        1 0  sij  1
                                                     d m1
                                                              d mj    0 0  dijAnalysis
                                                                          
                                                                          Clustering
                                                                                     1
Distance Functions (Overview)
   To transform a data matrix to similarity or dissimilarity
    matrices, we need a definition of distance.
   Some definitions of distance functions depend on the type of
    attributes
        interval-scaled attributes
        Boolean attributes
        nominal, ordinal and ratio attributes.
   Weights should be associated with different attributes based
    on applications and data semantics.
   It is hard to define “similar enough” or “good enough”
        the answer is typically highly subjective.


    14                                                 Clustering Analysis
Similarity and Dissimilarity Between Objects (I)
 Distances are normally used to measure the similarity
  or dissimilarity between two data objects
 Some popular ones include: Minkowski distance:
       dij  q (| xi1  x j1 |q  | xi 2  x j 2 |q ... | xip  x jp |q )

    where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data
    objects, and q is a positive integer
   If q = 1, d is Manhattan distance
           dij  | xi1  x j1 |  | xi 2  x j 2 | ... | xip  x jp |


 15                                                                        Clustering Analysis
Similarity and Dissimilarity Between Objects (II)
    If q = 2, d is Euclidean distance:

           dij  (| xi1  x j1 |2  | xi 2  x j 2 |2 ... | xip  x jp |2 )

    Properties
     d(i,j) >= 0
     d(i,i) = 0
     d(i,j) = d(j,i)
     d(i,j) <= d(i,k) + d(k,j)
    Also one can use weighted distance, parametric Pearson product
     moment correlation, or other dissimilarity measures.
             d  (w | x  x | w | x  x | ... w | x  x | )
               ij       1   i1   j1
                                      2
                                          2   i2   j2
                                                        2
                                                               i   ip    jp
                                                                              2



    16                                                                  Clustering Analysis
Types of Attributes in Clustering
    Interval-scaled attributes
        Continuous measures of a roughly linear scale
    Binary attributes
        Two-state measures: 0 or 1
    Nominal, ordinal, and ratio attributes
        More than two states, nominal or ordinal or nonlinear scale
    Mixed types
        Mixture of interval-scaled, symmetric binary, asymmetric binary,
         nominal, ordinal, or ratio-scaled attributes


    17                                              Clustering Analysis: Types of Attributes
Interval-valued Attributes
    Standardize data
        Calculate the mean absolute deviation:
                  s f  1 (| x1 f  m f |  | x2 f  m f | ... | xnf  m f |)
                        n
     where        m f  1 (x1 f  x2 f
                        n                  ...    xnf )
                                                        .




        Calculate the standardized measurement (mean-absolute-based z-
                                      xif  m f
         score)                 zif      sf
    Using mean absolute deviation is more robust than using standard
     deviation since the z-scores of outliers do not become too small.
     Hence, the outliers remain detectable.

    18                                                      Clustering Analysis: Types of Attributes
A binary variable contains two
                                                            possible outcomes: 1

Binary Attributes                                           (positive/present) or 0
                                                            (negative/absent).
                                                            • If there is no preference for
    A contingency table for binary data                      which outcome should be
                                                              coded as 0 and which as 1, the
                                                              binary variable is
                                      Object j                called symmetric.
                                  1       0       sum       • If the outcomes of a binary
                                                              variable are not equally
                          1    a     b           a b         important, the binary variable is
                                                              called asymmetric, such as "is
                Object i  0    c     d           cd          color-blind" for a human being.
                                                              The most important outcome
                         sum a  c b  d           p          is usually coded as 1 (present)
                                                              and the other is coded as 0
                                                              (absent).

    Simple matching coefficient (invariant, if the binary variable is
     symmetric):                                 bc
                                  dij 
                                              abcd
    Jaccard coefficient (noninvariant if the binary variable is
     asymmetric):                              bc
                                  dij 
    19
                                              a  b  cAnalysis:Types of Attributes
                                                  Clustering
Dissimilarity on Binary Attributes
    Example
          Name     Gender    Fever   Cough     Test-1   Test-2       Test-3       Test-4
          Jack     M         Y       N         P        N            N            N
          Mary     F         Y       N         P        N            P            N
          Jim      M         Y       P         N        N            N            N

    The gender is symmetric attribute and the remaining attributes are asymmetric
     binary. Here, let the values Y and P be set to 1, and the value N be set to 0.
     Then calculate only asymmetric binary        0 1
                       d ( jack , mary)                   0.33
                                           2  0 1
                                           11
                       d ( jack , jim )           0.67
                                          111
                                           1 2
                       d ( jim , mary)             0.75
                                          11 2
    20                                                  Clustering Analysis: Types of Attributes
Nominal Attributes
    A generalization of the binary variable in that it can take
     more than 2 states, e.g., red, yellow, blue, green
    Method 1: Simple matching
        m is the # of matches, p is the total # of nominal attributes
                                pm
                          dij 
                                 p
    Method 2: Use a large number of binary variables
        creating a new binary variable for each of the M nominal
         states


    21                                           Clustering Analysis: Types of Attributes
Ordinal Attributes
    An ordinal variable can be discrete or continuous
    order is important, e.g., rank
    Can be treated like interval-scaled
    replacing xif by their rank       rif { ,..., M f }
                                             1
    map the range of each variable onto [0, 1] by replacing
     i-th object in the f-th variable by
                                  rif 1
                            zif 
                                  M f 1
    compute the dissimilarity using methods for interval-scaled
     variables
    22                                      Clustering Analysis: Types of Attributes
Ratio-Scaled Attributes
  Ratio-scaled variable: a positive measurement on a nonlinear
   scale, approximately at exponential scale, such as AeBt
   or Ae-Bt
 Methods:
(1) treat them like interval-scaled attributes
        not a good choice!
(2) apply logarithmic transformation
         yif = log(xif)
(3) treat them as continuous ordinal data and
    treat their rank as interval-scaled.
    23                                   Clustering Analysis: Types of Attributes
Mixed Types
    A database may contain all the six types of attributes
      symmetric binary, asymmetric binary, nominal, ordinal,
       interval and ratio.
    One may use a weighted formula to combine their effects.
                               p 1 ij f ) dij f )
                                       (       (
                       dij    f p
                                 f 1 ij f )
                                           (
        f is binary or nominal:
        dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise
        f is interval-based: use the normalized distance
        f is ordinal or ratio-scaled
               compute ranks rif and
               treat (normalized) zif as interval-scaled

    24                                            Clustering Analysis: Types of Attributes
Major Clustering Approaches
    Partitioning algorithms: Construct various partitions and then
     evaluate them by some criterion
    Hierarchy algorithms: Create a hierarchical decomposition of the
     set of data (or objects) using some criterion
    Density-based: based on connectivity and density functions
    Grid-based: based on a multiple-level granularity structure
    Model-based: A model is hypothesized for each of the clusters
     and the idea is to find the best fit of that model to each other




    25                                      Clustering Analysis: Clustering Approaches
Partitioning Approach
    Construct a partition of a database D of n objects into a set of k
     clusters
    Given a k, find a partition of k clusters that optimizes the chosen
     partitioning criterion
    Global optimal: exhaustively enumerate all partitions
    Heuristic methods: k-means and k-medoids algorithms
    k-means (MacQueen’67): Each cluster is represented by the
     center of the cluster
    k-medoids or PAM (Partition around medoids) (Kaufman &
     Rousseeuw’87): Each cluster is represented by one of the objects
     in the cluster

    26                                       Clustering Analysis: Partitioning Algorithms
The K-Means Clustering Method
(Overview)
    Given k, the k-means algorithm is implemented in 4 steps:
      Partition objects into k nonempty subsets
      Compute seed points as the centroids of the clusters of
       the current partition. The centroid is the center (mean
       point) of the cluster.
      Assign each object to the cluster with the nearest seed
       point.
      Go back to Step 2, stop when no more new assignment.




    27                                   Clustering Analysis: Partitioning Algorithms
The K-Means Clustering Method
(An Graphical Example)
     10                                                10

      9                                                 9

      8                                                 8

      7                                                 7

      6                                                 6

      5                                                 5

      4                                                 4

      3                                                 3

      2                                                 2

      1                                                 1

      0                                                 0
          0   1   2   3   4   5   6   7   8   9   10        0   1   2   3   4   5   6   7   8   9   10




     10                                                10

     9                                                 9

     8                                                 8

     7                                                 7

     6                                                 6

     5                                                 5
     4                                                 4
     3                                                 3
     2                                                 2
     1                                                 1
     0                                                 0
          0   1   2   3   4   5   6   7   8   9   10        0   1   2   3   4   5   6   7   8   9   10


28                                                     Clustering Analysis: Partitioning Algorithms
K-means example, step 1

Given k=3,


                      k1
         Y

Pick 3           k2
initial
cluster
centers
(randomly)
                                     k3

                           X
 29                            Clustering Analysis: Partitioning Algorithms
K-means example, step 2



                      k1
           Y

                 k2
Assign
each point
to the closest
cluster
center                               k3

                           X
  30                           Clustering Analysis: Partitioning Algorithms
K-means example, step 3



                        k1                   k1
            Y

Move               k2
each cluster
center                                          k3
                  k2
to the mean
of each cluster                        k3

                             X
   31                            Clustering Analysis: Partitioning Algorithms
K-means example, step 4



Reassign                                  k1
points        Y
closest to a
different new
cluster center
                                             k3
Q: Which          k2
points are
reassigned?

                          X
   32                         Clustering Analysis: Partitioning Algorithms
K-means example, step 4 …



                                        k1
              Y
A: three
points with
animation                                  k3
                  k2



                        X
  33                        Clustering Analysis: Partitioning Algorithms
K-means example, step 4b



                                        k1
             Y
re-compute
cluster
means                                      k3
                 k2



                        X
  34                        Clustering Analysis: Partitioning Algorithms
K-means example, step 5



                                              k1
           Y


                k2
move cluster
centers to                              k3
cluster means


                         X
   35                        Clustering Analysis: Partitioning Algorithms
Problems to be considered
        What can be the problems with K-means clustering?
        Result can vary significantly depending on initial choice of seeds
         (number and position)
        Can get trapped in local minimum             initial cluster
            Example:                                         centers



                                             instances




        Q: What can be done?
        A: To increase chance of finding global optimum: restart with
         different random seeds.
        What can be done about outliers?

    36                                            Clustering Analysis: Partitioning Algorithms
The K-Means Clustering Method
(Strength and Weakness)
    Strength
      Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t
        is # iterations. Normally, k, t << n
      Good for finding clusters with spherical shapes
      Often terminates at a local optimum. The global optimum may be
        found using techniques such as: deterministic annealing and genetic
        algorithms
    Weakness
      Applicable only when mean is defined, then what about categorical
        data?
      Need to specify k, no. of clusters, in advance
      Unable to handle noisy data and outliers
      Not suitable to discover clusters with non-convex shapes

    37                                             Clustering Analysis: Partitioning Algorithms
The K-Means Clustering Method
(Variations – I)
    A few variants of the k-means which differ in
      Selection of the initial k means            Mean of 1, 3, 5, 7, 9 is 5
                                                   Mean of 1, 3, 5, 7, 1009 is 205
      Dissimilarity calculations                  Median of 1, 3, 5, 7, 1009 is 5
                                                   Median advantage: not affected
      Strategies to calculate cluster means       by extreme values

      K-medoids – instead of mean, use medians of each cluster
      For large databases, use sampling
    Handling categorical data: k-modes (Huang’98)
      Replacing means of clusters with modes
      Using new dissimilarity measures to deal with categorical objects
      Using a frequency-based method to update modes of clusters
      A mixture of categorical/numerical data: k-prototype method



    38                                              Clustering Analysis: Partitioning Algorithms
The K-Medoids Clustering Method
(Overview)
    Find representative objects, called medoids, in clusters
    PAM (Partitioning Around Medoids, 1987)
        starts from an initial set of medoids and iteratively
         replaces one of the medoids by one of the non-medoids if
         it improves the total distance of the resulting clustering
        PAM works effectively for small data sets, but does not
         scale well for large data sets
    CLARA (Kaufmann & Rousseeuw, 1990)
    CLARANS (Ng & Han, 1994): Randomized sampling


    39                                       Clustering Analysis: Partitioning Algorithms
The K-Medoids Clustering Method
(PAM - Partitioning Around Medoids)

    PAM (Kaufman and Rousseeuw, 1987), built in Splus
    Use real object to represent the cluster
        Select k representative objects arbitrarily
        For each pair of non-selected object h and selected object
         i, calculate the total swapping cost Sih
        For each pair of i and h,
             If Sih < 0, i is replaced by h
             Then assign each non-selected object to the most similar
              representative object
        repeat steps 2-3 until there is no change


    40                                       Clustering Analysis: Partitioning Algorithms
PAM example
   Cluster the following data set of ten objects into two
    clusters i.e k = 2.
    X1       2         6
    X2       3         4
    X3       3         8
    X4       4         7
    X5       6         2
    X6       6         4
    X7       7         3
    X8       7         4
    X9       8         5
    X10      7         6


41                                               Clustering Analysis
PAM example, step 1
     Initialise k medoids. Let assume c1 = (3,4) and c2 =
      (7,4)
     Calculating distance so as to associate each data
      object to its nearestCost medoid. Assume that cost is Cost
      calculated )using Minkowski distance metric with r =(distan
        c 1
                 Data objects
                 (X
                              (distan    2 c
                                                 Data objects
                                                 (X )         1.
                      i
                              ce)                    i
                                                                      ce)
3             4   2       6   3     7        4   2        6           7
3             4   3       8   4     7        4   3        8           8
3             4   4       7   4     7        4   4        7           6
3             4   6       2   5     7        4   6        2           3
3             4   6       4   3     7        4   6        4           1
3             4   7       3   5     7        4   7        3           1
3             4   8       5   6     7        4   8        5           2
     42                                                  Clustering Analysis
3             4   7       6   6     7        4   7        6           2
PAM example, step 1b
     Then the clusters become:
      Cluster1 = {(3,4)(2,6)(3,8)(4,7)}
      Cluster2 = {(7,4)(6,2)(6,4)(7,3)(8,5)(7,6)}
     Total cost = 3 + 4 +Cost+ 3 + 1 + 1 + 2 + 2 = 20
                             4                                               Cost
                   Data objects                          Data objects
          c1                      (distan       c2                           (distan
                   (Xi)                                  (Xi)
                                  ce)                                        ce)
3              4   2       6      3         7        4   2       6           7
3              4   3       8      4         7        4   3       8           8
3              4   4       7      4         7        4   4       7           6
3              4   6       2      5         7        4   6       2           3
3              4   6       4      3         7        4   6       4           1
3              4   7       3      5         7        4   7       3           1
3              4   8       5      6         7        4   8       5           2
     43                                                         Clustering Analysis
3              4   7       6      6         7        4   7       6           2
PAM example, step 2
   Selection of nonmedoid O′ randomly. Let us
    
   assume O′ = (7,3). So now the medoids
   are c1(3,4) and O′(7,3)
  Calculate the cost of new medoid by using the
                         Cost                                   Cost
            Data objects                 Data objects
   formula in )the step1. Total cost =
     c   1
            (X
                         (distan     O′
                                         (X )
                                                                (distan
                     i
                         ce)                    i
                                                                ce)
3 3+4+4+2+2+1+3+33 = 22 7
       4    2       6                  3 2       6              8
3            4   3       8   4   7    3     3       8           9
3            4   4       7   4   7    3     4       7           7
3            4   6       2   5   7    3     6       2           2
3            4   6       4   3   7    3     6       4           2
3            4   7       3   5   7    3     7       4           1
3            4   8       5   6   7    3     8       5           3
    44                                              Clustering Analysis
3            4   7       6   6   7    3     7       6           3
PAM example, step 2b
 So cost of swapping medoid from c2 to O′ is
        S = current total cost – past total cost = 22-20
= 2 >0
 So moving to O′ would be bad idea, so the previous
  choice was good, and algorithm terminates here (i.e
  there is no change in the medoids).
 It may happen some data points may shift from one
  cluster to another cluster depending upon their
  closeness to medoid

 45                                            Clustering Analysis
CLARA (Clustering Large Applications) (1990)
    CLARA (Kaufmann and Rousseeuw in 1990)
        Built in statistical analysis packages, such as S+
    It draws multiple samples of the data set, applies PAM on each
     sample, and gives the best clustering as the output
    Strength: deals with larger data sets than PAM
    Weakness:
        Efficiency depends on the sample size
        A good clustering based on samples will not necessarily represent a good
         clustering of the whole data set if the sample is biased


    46                                                  Clustering Analysis: Partitioning Algorithms
CLARANS (“Randomized” CLARA) (1994)
    CLARANS (A Clustering Algorithm based on Randomized Search) (Ng
     and Han’94)
    CLARANS draws sample of neighbors dynamically
    The clustering process can be presented as searching a graph where
     every node is a potential solution, that is, a set of k medoids
    If the local optimum is found, CLARANS starts with new randomly
     selected node in search for a new local optimum
    It is more efficient and scalable than both PAM and CLARA
    Focusing techniques and spatial access structures may further improve
     its performance (Ester et al.’95)


    47                                         Clustering Analysis: Partitioning Algorithms
The Partition-Based Clustering
(Discussion)

    Result can vary significantly based on initial choice of seeds
    Algorithm can get trapped in a local minimum
      Example: four instances at the vertices of a two-
       dimensional rectangle
           Local minimum: two cluster centers at the midpoints
             of the rectangle’s long sides
    Simple way to increase chance of finding a global optimum:
     restart with different random seeds



    48                                     Clustering Analysis: Hierarchical Algorithms
Hierarchical Clustering
    Use distance matrix as clustering criteria.
    This method does not require the number of clusters k as an
     input, but needs a termination condition
         Step 0   Step 1   Step 2   Step 3   Step 4
                                                                  Agglomerative
                                                                    (AGNES)
         a
                  ab
         b                                abcde
         c
                                    cde
         d
                           de
         e
                                                                      Divisive
         Step 4   Step 3   Step 2   Step 1   Step 0                  (DIANA)
    49                                                Clustering Analysis: Hierarchical Algorithms
AGNES (Agglomerative Nesting)
    Introduced in Kaufmann and Rousseeuw (1990)
    Implemented in statistical analysis packages, e.g., Splus
    Use the Single-Link method and the dissimilarity matrix.
    Merge nodes that have the least dissimilarity
    Go on in a non-descending fashion
    Eventually all nodes belong to the same cluster
     10                                                10                                                          10


      9                                                 9                                                           9


      8                                                 8                                                           8


      7                                                 7                                                           7


      6                                                 6                                                           6


      5                                                 5                                                           5


      4                                                 4                                                           4


      3                                                 3                                                           3


      2                                                 2                                                           2


      1                                                 1                                                           1


      0                                                 0                                                           0
          0   1   2   3   4   5   6   7   8   9   10        0   1   2   3   4   5   6   7   8   9   10                  0   1   2   3   4   5   6   7   8   9   10




    50                                                                                                   Clustering Analysis: Hierarchical Algorithms
Dendrogram for Hierarchical Clustering
    Decompose data objects into a several levels of nested
     partitioning (tree of clusters), called a dendrogram.
    A clustering of the data objects is obtained by cutting the
     dendrogram at the desired level, then each connected
     component forms a cluster.




    51                                       Clustering Analysis: Hierarchical Algorithms
DIANA - Divisive Analysis
    Introduced in Kaufmann and Rousseeuw (1990)
    Implemented in statistical analysis packages, e.g., Splus
    Inverse order of AGNES
    Eventually each node forms a cluster on its own


         10                                                                                                         10
                                                           10

          9                                                                                                          9
                                                            9

          8                                                                                                          8
                                                            8

          7                                                                                                          7
                                                            7

          6                                                                                                          6
                                                            6

          5                                                                                                          5
                                                            5

          4                                                                                                          4
                                                            4

          3                                                                                                          3
                                                            3

          2                                                                                                          2
                                                            2

          1                                                                                                          1
                                                            1

          0                                                                                                          0
                                                            0
              0   1   2   3   4   5   6   7   8   9   10                                                                 0   1   2   3   4   5   6   7   8   9   10
                                                                0   1   2   3   4   5   6   7   8   9   10




    52                                                                                                   Clustering Analysis: Hierarchical Algorithms
Hierarchical Clustering
   Major weakness of agglomerative clustering methods
     do not scale well: time complexity of at least O(n2),
      where n is the number of total objects
       can never undo what was done previously




53                                      Clustering Analysis: Hierarchical Algorithms
Distances - Hierarchical Clustering
(Overview)

    Four measures for distance between clusters are:
        Single linkage (Minimum distance):

                dmin (Ci , C j )  minpCi , p'Cj p  p'
        Complete linkage (Maximum distance):

                dmax (Ci , C j )  max pCi , p'Cj p  p'
        Centroid comparison (Mean distance):

                d mean (Ci , C j )  mi  m j
        Element comparison (Average distance):

                                     1
                d avg (Ci , C j )            p  p'
    54
                                    ni n j pCi p'C j Analysis: Hierarchical Algorithms
                                                Clustering
Distances - Hierarchical Clustering
 (Graphical Representation)
     Four measures for distance between clusters are (1) single linkage, (2) complete
      linkage, (3) centroid comparison and (4) element comparison

          Cluster 1         (1)single              Cluster 2


                                                           (4) Element comparison:
                      x                      x             average distance among all
                                                           elements in two clusters


                                                     (3)centroid
(2)complete
                                         Cluster 3
                            x                               x = centroids


     55                                                  Clustering Analysis: Hierarchical Algorithms
Practice
   Use single and complete link agglomerative clustering
    to group the data described by the following
    distance matrix. Show the dendrograms.
                 A       B        C       D
        A        0       1        4       5
        B                0        2       6
        C                         0       3
        D                                 0




56                                              Clustering Analysis
Other Clustering Methods
    Incremental Clustering
    Probability-based Clustering, Bayesian Clustering
    EM Algorithm

Cluster Schemes
    Partitioning Methods
    Hierarchical Methods
    Density-Based Methods
    Grid-Based Methods
    Model-Based Clustering Methods


    57                                                   Clustering Analysis
Advanced Method: BIRCH (Overview)
    Balanced Iterative Reducing and Clustering using Hierarchies [Tian Zhang,
     Raghu Ramakrishnan, Miron Livny, 1996]
    Incremental, hierarchical, one scan
    Save clustering information in a tree
    Each entry in the tree contains information about one cluster
    New nodes inserted in closest entry in tree
    Only works with "metric" attributes
      Must have Euclidean coordinates
    Designed for very large data sets
      Time and memory constraints are explicit
      Treats dense regions of data points as sub-clusters
         Not all data points are important for clustering
      Only one scan of data is necessary


    58                                                             Clustering Analysis
BIRCH (Merits)
    Incremental, distance-based approach
    Decisions are made without scanning all data points, or all
     currently existing clusters
    Does not need the whole data set in advance
    Unique approach: Distance-based algorithms generally need
     all the data points to work
    Make best use of available memory while minimizing I/O
     costs
    Does not assume that the probability distributions on
     attributes is independent


    59                                                 Clustering Analysis
BIRCH – Clustering Feature and Clustering Feature Tree
    BIRTH introduces two concepts, clustering feature and clustering
     feature tree (CF Tree), which are used to summarize cluster
     representations.
    These structures help the clustering method achieve good speed and
     scalability in large databases and make it effective for incremental and
     dynamic clustering of incoming object
    Given n d-dimensional data objects or points in a cluster, we can
     define the centroid x0, radius R and diameter D of the cluster as
     follows:




    60                                                           Clustering Analysis
BIRCH – Centroid, Radius and Diameter
• Given a cluster of instances          , we define:
  • Centroid: the center of a cluster


  • Radius: average distance from member points to centroid


  • Diameter: average pair-wise distance within a cluster




 61                                                 Clustering Analysis
BIRCH – Centroid Euclidean and Manhattan distances
• The centroid Euclidean distance and centroid
  Manhattan distance are defined between any two
  clusters.
  • Centroid Euclidean distance



  • Centroid Manhattan distance




 62                                                  Clustering Analysis
BIRCH
(Average inter-cluster, Average intra-cluster, Variance increase)
• The average inter-cluster, the average intra-cluster, and the
  variance increase distances are defined as follows
  • Average inter-cluster


  • Average intra-cluster


  • Variance increase distances



 63                                                          Clustering Analysis
Clustering Feature
        CF = (N,LS,SS)
         N: Number of points in cluster
         LS: Sum of points in the cluster
         SS: Sum of squares of points in the cluster
    CF Tree
         Balanced search tree
         Node has CF triple for each child
         Leaf node represents cluster and has CF value for each
          subcluster in it.
         Subcluster has maximum diameter


    64                                                   Clustering Analysis
Clustering Feature Vector
                                                                                   (3,4)       (4,7)
 Clustering Feature: CF = (N, LS, SS)                                              (2,6)       (3,8)
                                                                                   (4,5)
 N: Number of data points
 LS: Ni=1=Xi        10
                                                                       CF = (5, (16,30),(54,190))

 SS: Ni=1=Xi2
                      9


                      8


                      7


                      6
                                                                               (6,2)   (8,4)
                                                                               (7,2)   (8,5)
                      5


                      4


                      3


                      2                                                        (7,4)
                      1


                      0
                          0   1   2   3   4   5   6   7   8   9   10




                                                                        CF = (5, (36,17),(262,65))
CF = (10, (52,47),(316,255))

        CF1  CF2  ( N1  N 2 , LS1  LS2 , SS1  SS2 )
   65                                                                                      Clustering Analysis
Merging two clusters
Cluster {Xi}:          Cluster {Xj}:
i = 1, 2, …, N1        j = N1+1, N1+2, …, N1+N2




                  Cluster Xl = {Xi} + {Xj}:
                  l = 1, 2, …, N1, N1+1, N1+2, …, N1+N2
66                                            Clustering Analysis
CF Tree                         Root
B=7            CF1      CF2    CF3                CF6
               child1   child2 child3             child6
L=6

                     Non-leaf node
      CF1      CF2 CF3                   CF5
      child1   child2 child3             child5


                    Leaf node                              Leaf node
prev CF1 CF2            CF6 next        prev CF1 CF2        CF4 next



 67                                                        Clustering Analysis
Properties of CF-Tree
         Each non-leaf node has at most B entries
         Each leaf node has at most L CF entries which each satisfy threshold
          T
         Node size is determined by dimensionality of data space and input
          parameter P (page size)



Branching Factor
and
Thread hold



     68                                                            Clustering Analysis
BIRCH Algorithm (CF-Tree Insertion)




              Recurse down from root, find the appropriate leaf
              Follow the "closest"-CF path, w.r.t. D0 / … / D4
              Modify the leaf
              If the closest-CF leaf cannot absorb, make a new CF entry.
              If there is no room for new leaf, split the parent node
              Traverse back & up
69            Updating CFs on the path Analysis
                                Clustering or splitting nodes
Improve Clusters




70                 Clustering Analysis
BIRCH Algorithm (Overall steps)




71                                Clustering Analysis
Details of Each Step
    Phase 1: Load data into memory
        Build an initial in-memory CF-tree with the data (one scan)
        Subsequent phases become fast, accurate, less order sensitive
    Phase 2: Condense data
        Rebuild the CF-tree with a larger T
        Condensing is optional
    Phase 3: Global clustering
        Use existing clustering algorithm on CF entries
        Helps fix problem where natural clusters span nodes
    Phase 4: Cluster refining
        Do additional passes over the dataset & reassign data points to the
         closest centroid from phase 3
        Refining is optional


    72                                                           Clustering Analysis
Summary of BIRCH
    BIRCH works with very large data sets
    Explicitly bounded by computational resources.
    The computation complexity is O(n), where n is the number of
     objects to be clustered.
    Runs with specified amount of memory (P)
    Superior to CLARANS and k-MEANS
    Quality, speed, stability and scalability




    73                                                   Clustering Analysis
CURE (Clustering Using REpresentatives)
    CURE was proposed by Guha, Rastogi & Shim, 1998
    It stops the creation of a cluster hierarchy if a level consists of k
     clusters
    Each cluster has c representatives (instead of one)
        Choose c well scattered points in the cluster
        Shrink them towards the mean of the cluster by a fraction of 
        The representatives capture the physical shape and geometry of the
         cluster
        It can treat arbitrary shaped clusters and avoid single-link effect.
    Merge the closest two clusters
        Distance of two clusters: the distance between the two closest
         representatives

    74                                                                Clustering Analysis
CURE Algorithm




75               Clustering Analysis
CURE Algorithm




76               Clustering Analysis
CURE Algorithm (Another form)




77                              Clustering Analysis
CURE Algorithm (for Large Databases)




78                                     Clustering Analysis
Data Partitioning and Clustering
         y                                s = 50
                                          p=2
                                          s/p = 25
                 x                        s/pq = 5
y            y
                       y              y




                                  x                     x
                           x                     x
    79                                    Clustering Analysis
Cure: Shrinking Representative Points
    Shrink the multiple representative points towards the gravity center by
     a fraction of .
    Multiple representatives capture the shape of the cluster
y                                           y




                               x                                                 x

    80                                                           Clustering Analysis
Clustering Categorical Data: ROCK
   ROCK: RObust Clustering using linKs,
    by S. Guha, R. Rastogi, K. Shim (ICDE’99).
       Use links to measure similarity/proximity
       Not distance-based with categorical attributes
       Computational complexity: O(n2  nmmma  n2 log n)
   Basic ideas (Jaccard coefficient):
    Similarity function and neighbors:
    
                                                         T1  T2
    Let T1 = {1,2,3}, T2={3,4,5}         Sim(T1 , T2 ) 
                                                         T1  T2
                    {3}        1
Sim(T1, T 2)                  0.2
                {1,2,3,4,5} 5
  81                                                Clustering Analysis
ROCK: An Example
    Links: The number of common neighbors for the two points.
     Using Jaccard
    Use Distances to determine neighbors
        (pt1,pt4) = 0, (pt1,pt2) = 0, (pt1,pt3) = 0
        (pt2,pt3) = 0.6, (pt2,pt4) = 0.2
        (pt3,pt4) = 0.2
    Use 0.2 as threshold for neighbors
        Pt2 and Pt3 have 3 common neighbors
        Pt3 and Pt4 have 3 common neighbors
        Pt2 and Pt4 have 3 common neighbors
    Resulting clusters (1), (2,3,4) which makes more sense

    82                                                        Clustering Analysis
ROCK: Property & Algorithm
   Links: The number of common neighbours for the
    two points.
       {1,2,3}, {1,2,4}, {1,2,5}, {1,3,4}, {1,3,5}
       {1,4,5}, {2,3,4}, {2,3,5}, {2,4,5}, {3,4,5}
                           3
               {1,2,3}          {1,2,4}
   Algorithm
       Draw random sample
       Cluster with links (maybe agglomerative hierarchical)
       Label data in disk
83                                                       Clustering Analysis
CHAMELEON
        CHAMELEON: hierarchical clustering using dynamic modeling, by G.
         Karypis, E.H. Han and V. Kumar’99
        Measures the similarity based on a dynamic model
           Two clusters are merged only if the interconnectivity and
             closeness (proximity) between two clusters are high relative to
             the internal interconnectivity of the clusters and closeness of
             items within the clusters
        A two phase algorithm
          1. Use a graph partitioning algorithm: cluster objects into a large
             number of relatively small sub-clusters
          2. Use an agglomerative hierarchical clustering algorithm: find the
             genuine clusters by repeatedly combining these sub-clusters

    84                                                             Clustering Analysis
Graph-based clustering
    Sparsification techniques keep the connections to the most
     similar (nearest) neighbors of a point while breaking the
     connections to less similar points.
    The nearest neighbors of a point tend to belong to the same
     class as the point itself.
    This reduces the impact of noise and outliers and sharpens
     the distinction between clusters.




    85                                                Clustering Analysis
Overall Framework of CHAMELEON
           Construct
           Sparse Graph                    Partition the Graph




Data Set




                                                                 Merge Partition

                          Final Clusters




      86                                                         Clustering Analysis
Density-Based Clustering Methods
    Clustering based on density (local cluster criterion), such as
     density-connected points
    Major features:
        Discover clusters of arbitrary shape
        Handle noise
        One scan
        Need density parameters as termination condition
    Several interesting studies:
        DBSCAN: Ester, et al. (KDD’96)
        OPTICS: Ankerst, et al (SIGMOD’99).
        DENCLUE: Hinneburg & D. Keim (KDD’98)
        CLIQUE: Agrawal, et al. (SIGMOD’98)

    87                                                        Clustering Analysis
Density-Based Clustering (Background)
    Two parameters:
        Eps: Maximum radius of the neighbourhood
        MinPts: Minimum number of points in an Eps-neighbourhood of that point
    NEps(p):        {q belongs to D | dist(p,q) <= Eps}
    Directly density-reachable: A point p is directly density -reachable
     from a point q wrt. Eps, MinPts if
     1) p belongs to NEps(q)
     2) core point condition:                             p       MinPts = 5
                |NEps (q)| >= MinPts                  q
                                                                  Eps = 1 cm


    88                                                             Clustering Analysis
Density-Based Clustering (Background)
    The neighborhood with in a radius Eps of a given object is call the 
     neighborhood of the object.
    If the neighborhood of an object contains at least a minimum
     number, MinPts, of objects, then the object is called a core object.
    Given a set of objects, D, we say that an object p is directly density-
     reachable from object q if p is within the neighborhood of q, and q
     is a core object.
    An object p is density-reachable from object q with respect to  and
     MinPts in a set of objects D, if there is a chain of object p1, …, pn,
     where p1=q and pn=p such that p(i+1) is directly density-reachable from
     pi with respect to  and MinPts, for 1≤i≤n, from pi  D.
    An object p is density-connected to object q with respect to


    89                                                           Clustering Analysis
Density-Based Clustering                                                            p
                                                                            p1
                                                                   q
    Density-reachable:
        A point p is density-reachable from a point q wrt. Eps, MinPts if there is a
         chain of points p1, …, pn, p1 = q, pn = p such that pi+1 is directly density-
         reachable from pi
    Density-connected
        A point p is density-connected to a point q wrt. Eps, MinPts if there is a
         point o such that both, p and q are density-reachable from o wrt. Eps and
         MinPts.
                                                            p                            q

                                                                        o

    90                                                                   Clustering Analysis
DBSCAN: Density Based Spatial Clustering of Applications with
Noise
    Relies on a density-based notion of cluster: A cluster is defined
     as a maximal set of density-connected points
    Discovers clusters of arbitrary shape in spatial databases with
     noise

                                                    Outlier

                Border
                                                        Eps = 1cm
                     Core                               MinPts = 5



    91                                                        Clustering Analysis
DBSCAN: The Algorithm
    Arbitrary select a point p

    Retrieve all points density-reachable from p wrt Eps and MinPts.

    If p is a core point, a cluster is formed.

    If p is a border point, no points are density-reachable from p and
     DBSCAN visits the next point of the database.

    Continue the process until all of the points have been processed.




92                                                            Clustering Analysis

More Related Content

What's hot

Chapter 11 cluster advanced : web and text mining
Chapter 11 cluster advanced : web and text miningChapter 11 cluster advanced : web and text mining
Chapter 11 cluster advanced : web and text miningHouw Liong The
 
Cluster Analysis
Cluster AnalysisCluster Analysis
Cluster AnalysisSSA KPI
 
Data Mining: Concepts and Techniques — Chapter 2 —
Data Mining:  Concepts and Techniques — Chapter 2 —Data Mining:  Concepts and Techniques — Chapter 2 —
Data Mining: Concepts and Techniques — Chapter 2 —Salah Amean
 
Cluster Analysis
Cluster AnalysisCluster Analysis
Cluster Analysisguest0edcaf
 
K means clustering
K means clusteringK means clustering
K means clusteringkeshav goyal
 
Unsupervised learning clustering
Unsupervised learning clusteringUnsupervised learning clustering
Unsupervised learning clusteringDr Nisha Arora
 
Cluster Analysis Introduction
Cluster Analysis IntroductionCluster Analysis Introduction
Cluster Analysis IntroductionPrasiddhaSarma
 
Introduction to Clustering algorithm
Introduction to Clustering algorithmIntroduction to Clustering algorithm
Introduction to Clustering algorithmhadifar
 
3.5 model based clustering
3.5 model based clustering3.5 model based clustering
3.5 model based clusteringKrish_ver2
 
Clustering
ClusteringClustering
ClusteringMeme Hei
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysisAcad
 
Large Scale Data Clustering: an overview
Large Scale Data Clustering: an overviewLarge Scale Data Clustering: an overview
Large Scale Data Clustering: an overviewVahid Mirjalili
 

What's hot (20)

Chapter 11 cluster advanced : web and text mining
Chapter 11 cluster advanced : web and text miningChapter 11 cluster advanced : web and text mining
Chapter 11 cluster advanced : web and text mining
 
Data clustering
Data clustering Data clustering
Data clustering
 
Cluster Analysis
Cluster AnalysisCluster Analysis
Cluster Analysis
 
Data Mining: Concepts and Techniques — Chapter 2 —
Data Mining:  Concepts and Techniques — Chapter 2 —Data Mining:  Concepts and Techniques — Chapter 2 —
Data Mining: Concepts and Techniques — Chapter 2 —
 
Clustering in Data Mining
Clustering in Data MiningClustering in Data Mining
Clustering in Data Mining
 
Dataa miining
Dataa miiningDataa miining
Dataa miining
 
Cluster Analysis
Cluster AnalysisCluster Analysis
Cluster Analysis
 
Clustering
ClusteringClustering
Clustering
 
Chapter8
Chapter8Chapter8
Chapter8
 
Clustering: A Survey
Clustering: A SurveyClustering: A Survey
Clustering: A Survey
 
K means clustering
K means clusteringK means clustering
K means clustering
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
 
Unsupervised learning clustering
Unsupervised learning clusteringUnsupervised learning clustering
Unsupervised learning clustering
 
cluster analysis
cluster analysiscluster analysis
cluster analysis
 
Cluster Analysis Introduction
Cluster Analysis IntroductionCluster Analysis Introduction
Cluster Analysis Introduction
 
Introduction to Clustering algorithm
Introduction to Clustering algorithmIntroduction to Clustering algorithm
Introduction to Clustering algorithm
 
3.5 model based clustering
3.5 model based clustering3.5 model based clustering
3.5 model based clustering
 
Clustering
ClusteringClustering
Clustering
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
 
Large Scale Data Clustering: an overview
Large Scale Data Clustering: an overviewLarge Scale Data Clustering: an overview
Large Scale Data Clustering: an overview
 

Viewers also liked

Cure, Clustering Algorithm
Cure, Clustering AlgorithmCure, Clustering Algorithm
Cure, Clustering AlgorithmLino Possamai
 
Chapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; KamberChapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kambererror007
 
Data Mining and Data Warehousing
Data Mining and Data WarehousingData Mining and Data Warehousing
Data Mining and Data WarehousingAswathy S Nair
 
Data Mining: Concepts and techniques: Chapter 11,Review: Basic Cluster Analys...
Data Mining: Concepts and techniques: Chapter 11,Review: Basic Cluster Analys...Data Mining: Concepts and techniques: Chapter 11,Review: Basic Cluster Analys...
Data Mining: Concepts and techniques: Chapter 11,Review: Basic Cluster Analys...Salah Amean
 
Statistics and Data Mining
Statistics and  Data MiningStatistics and  Data Mining
Statistics and Data MiningR A Akerkar
 
Apache kylin 2.0: from classic olap to real-time data warehouse
Apache kylin 2.0: from classic olap to real-time data warehouseApache kylin 2.0: from classic olap to real-time data warehouse
Apache kylin 2.0: from classic olap to real-time data warehouseYang Li
 

Viewers also liked (20)

Birch
BirchBirch
Birch
 
Dbm630 lecture10
Dbm630 lecture10Dbm630 lecture10
Dbm630 lecture10
 
Dbm630_lecture01
Dbm630_lecture01Dbm630_lecture01
Dbm630_lecture01
 
Dbm630 lecture07
Dbm630 lecture07Dbm630 lecture07
Dbm630 lecture07
 
Dbm630 lecture04
Dbm630 lecture04Dbm630 lecture04
Dbm630 lecture04
 
Introduction to Data Warehousing
Introduction to Data WarehousingIntroduction to Data Warehousing
Introduction to Data Warehousing
 
Dbm630 lecture08
Dbm630 lecture08Dbm630 lecture08
Dbm630 lecture08
 
Datawarehouse and OLAP
Datawarehouse and OLAPDatawarehouse and OLAP
Datawarehouse and OLAP
 
Dbm630_lecture02-03
Dbm630_lecture02-03Dbm630_lecture02-03
Dbm630_lecture02-03
 
Birch
BirchBirch
Birch
 
Dbm630 lecture05
Dbm630 lecture05Dbm630 lecture05
Dbm630 lecture05
 
A survey on web usage mining techniques
A survey on web usage mining techniquesA survey on web usage mining techniques
A survey on web usage mining techniques
 
Cure, Clustering Algorithm
Cure, Clustering AlgorithmCure, Clustering Algorithm
Cure, Clustering Algorithm
 
Hierarchical Clustering
Hierarchical ClusteringHierarchical Clustering
Hierarchical Clustering
 
Chapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; KamberChapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
 
Data Mining and Data Warehousing
Data Mining and Data WarehousingData Mining and Data Warehousing
Data Mining and Data Warehousing
 
Web Usage Pattern
Web Usage PatternWeb Usage Pattern
Web Usage Pattern
 
Data Mining: Concepts and techniques: Chapter 11,Review: Basic Cluster Analys...
Data Mining: Concepts and techniques: Chapter 11,Review: Basic Cluster Analys...Data Mining: Concepts and techniques: Chapter 11,Review: Basic Cluster Analys...
Data Mining: Concepts and techniques: Chapter 11,Review: Basic Cluster Analys...
 
Statistics and Data Mining
Statistics and  Data MiningStatistics and  Data Mining
Statistics and Data Mining
 
Apache kylin 2.0: from classic olap to real-time data warehouse
Apache kylin 2.0: from classic olap to real-time data warehouseApache kylin 2.0: from classic olap to real-time data warehouse
Apache kylin 2.0: from classic olap to real-time data warehouse
 

Similar to Dbm630 lecture09

4 image segmentation through clustering
4 image segmentation through clustering4 image segmentation through clustering
4 image segmentation through clusteringIAEME Publication
 
4 image segmentation through clustering
4 image segmentation through clustering4 image segmentation through clustering
4 image segmentation through clusteringprjpublications
 
A comprehensive survey of contemporary
A comprehensive survey of contemporaryA comprehensive survey of contemporary
A comprehensive survey of contemporaryprjpublications
 
OBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkk
OBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkkOBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkk
OBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkkshesnasuneer
 
OBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkk
OBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkkOBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkk
OBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkkshesnasuneer
 
Gene Expression Data Analysis
Gene Expression Data AnalysisGene Expression Data Analysis
Gene Expression Data AnalysisJhoirene Clemente
 
20IT501_DWDM_PPT_Unit_IV.ppt
20IT501_DWDM_PPT_Unit_IV.ppt20IT501_DWDM_PPT_Unit_IV.ppt
20IT501_DWDM_PPT_Unit_IV.pptSamPrem3
 
20IT501_DWDM_PPT_Unit_IV.ppt
20IT501_DWDM_PPT_Unit_IV.ppt20IT501_DWDM_PPT_Unit_IV.ppt
20IT501_DWDM_PPT_Unit_IV.pptPalaniKumarR2
 
15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learningAnil Yadav
 
Unsupervised Learning.pptx
Unsupervised Learning.pptxUnsupervised Learning.pptx
Unsupervised Learning.pptxGandhiMathy6
 
Analysis of Classification Algorithm in Data Mining
Analysis of Classification Algorithm in Data MiningAnalysis of Classification Algorithm in Data Mining
Analysis of Classification Algorithm in Data Miningijdmtaiir
 
Python for data science
Python for data sciencePython for data science
Python for data sciencebotsplash.com
 

Similar to Dbm630 lecture09 (20)

Cs501 cluster analysis
Cs501 cluster analysisCs501 cluster analysis
Cs501 cluster analysis
 
Clustering
ClusteringClustering
Clustering
 
Cluster
ClusterCluster
Cluster
 
I1803026164
I1803026164I1803026164
I1803026164
 
4 image segmentation through clustering
4 image segmentation through clustering4 image segmentation through clustering
4 image segmentation through clustering
 
4 image segmentation through clustering
4 image segmentation through clustering4 image segmentation through clustering
4 image segmentation through clustering
 
clustering.ppt
clustering.pptclustering.ppt
clustering.ppt
 
Bj24390398
Bj24390398Bj24390398
Bj24390398
 
Cluster Analysis.pptx
Cluster Analysis.pptxCluster Analysis.pptx
Cluster Analysis.pptx
 
A comprehensive survey of contemporary
A comprehensive survey of contemporaryA comprehensive survey of contemporary
A comprehensive survey of contemporary
 
OBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkk
OBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkkOBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkk
OBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkk
 
OBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkk
OBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkkOBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkk
OBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkk
 
Gene Expression Data Analysis
Gene Expression Data AnalysisGene Expression Data Analysis
Gene Expression Data Analysis
 
20IT501_DWDM_PPT_Unit_IV.ppt
20IT501_DWDM_PPT_Unit_IV.ppt20IT501_DWDM_PPT_Unit_IV.ppt
20IT501_DWDM_PPT_Unit_IV.ppt
 
20IT501_DWDM_PPT_Unit_IV.ppt
20IT501_DWDM_PPT_Unit_IV.ppt20IT501_DWDM_PPT_Unit_IV.ppt
20IT501_DWDM_PPT_Unit_IV.ppt
 
47 292-298
47 292-29847 292-298
47 292-298
 
15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning
 
Unsupervised Learning.pptx
Unsupervised Learning.pptxUnsupervised Learning.pptx
Unsupervised Learning.pptx
 
Analysis of Classification Algorithm in Data Mining
Analysis of Classification Algorithm in Data MiningAnalysis of Classification Algorithm in Data Mining
Analysis of Classification Algorithm in Data Mining
 
Python for data science
Python for data sciencePython for data science
Python for data science
 

More from Tokyo Institute of Technology (11)

Lecture 4 online and offline business model generation
Lecture 4 online and offline business model generationLecture 4 online and offline business model generation
Lecture 4 online and offline business model generation
 
Lecture 4: Brand Creation
Lecture 4: Brand CreationLecture 4: Brand Creation
Lecture 4: Brand Creation
 
Lecture3 ExperientialMarketing
Lecture3 ExperientialMarketingLecture3 ExperientialMarketing
Lecture3 ExperientialMarketing
 
Lecture3 Tools and Content Creation
Lecture3 Tools and Content CreationLecture3 Tools and Content Creation
Lecture3 Tools and Content Creation
 
Lecture2: Innovation Workshop
Lecture2: Innovation WorkshopLecture2: Innovation Workshop
Lecture2: Innovation Workshop
 
Lecture0: introduction Online Marketing
Lecture0: introduction Online MarketingLecture0: introduction Online Marketing
Lecture0: introduction Online Marketing
 
Lecture2: Marketing and Social Media
Lecture2: Marketing and Social MediaLecture2: Marketing and Social Media
Lecture2: Marketing and Social Media
 
Lecture1: E-Commerce Business Model
Lecture1: E-Commerce Business ModelLecture1: E-Commerce Business Model
Lecture1: E-Commerce Business Model
 
Lecture0: Introduction Social Commerce
Lecture0: Introduction Social CommerceLecture0: Introduction Social Commerce
Lecture0: Introduction Social Commerce
 
Dbm630 lecture06
Dbm630 lecture06Dbm630 lecture06
Dbm630 lecture06
 
Coursesyllabus_dbm630
Coursesyllabus_dbm630Coursesyllabus_dbm630
Coursesyllabus_dbm630
 

Recently uploaded

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 

Recently uploaded (20)

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 

Dbm630 lecture09

  • 1. DBM630: Data Mining and Data Warehousing MS.IT. Rangsit University Semester 2/2011 Lecture 9 Clustering by Kritsada Sriphaew (sriphaew.k AT gmail.com) 1
  • 2. Topics  What is Cluster Analysis?  Types of Attributes in Cluster Analysis  Major Clustering Approaches  Partitioning Algorithms  Hierarchical Algorithms 2 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 3. Classification vs. Clustering Classification: Supervised learning Learns a method for predicting the instance class from pre-labeled (classified) instances 3 Clustering Analysis
  • 4. Clustering Unsupervised learning: Finds “natural” grouping of instances given un-labeled data 4 Clustering Analysis
  • 5. Clustering Methods  Many different method and algorithms:  For numeric and/or symbolic data  Deterministic vs. probabilistic  Exclusive vs. overlapping  Hierarchical vs. flat  Top-down vs. bottom-up 5 Clustering Analysis
  • 6. What is Cluster Analysis ?  Cluster: a collection of data objects  High similarity of objects within a cluster  Low similarity of objects across clusters  Cluster analysis  Grouping a set of data objects into clusters  Clustering is an unsupervised classification: no predefined classes  Typical applications  As a stand-alone tool to get insight into data distribution  As a preprocessing step for other algorithms 6 Clustering Analysis
  • 7. General Applications of Clustering  Pattern Recognition  In biology, derive plant and animal taxonomies, categorize genes  Spatial Data Analysis  create thematic maps in GIS by clustering feature spaces  detect spatial clusters and explain them in spatial data mining  Image Processing  Economic Science (e.g. market research)  discovering distinct groups in their customer bases and characterize customer groups based on purchasing patterns  WWW  Document classification  Cluster Weblog data to discover groups of similar access patterns 7 Clustering Analysis
  • 8. Examples of Clustering Applications  Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs  Land use: Identification of areas of similar land use in an earth observation database  Insurance: Identifying groups of motor insurance policy holders with a high average claim cost  City-planning: Identifying groups of houses according to their house type, value, and geographical location  Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults 8 Clustering Analysis
  • 9. Clustering Evaluation  Manual inspection  Benchmarking on existing labels  Cluster quality measures  distance measures  high similarity within a cluster, low across clusters 9 Clustering Analysis
  • 10. Criteria for Clustering  A good clustering method will produce high quality clusters with  high intra-class similarity  low inter-class similarity  The quality of a clustering result depends on:  both the similarity measure used by the method and its implementation to the new cases  ability to discover some or all of the hidden patterns 10 Clustering Analysis
  • 11. Requirements of Clustering in Data Mining  Scalability  Ability to deal with different types of attributes  Discovery of clusters with arbitrary shape  Minimal requirements for domain knowledge to determine input parameters  Able to deal with noise and outliers  Insensitive to order of input records  High dimensionality  Incorporation of user-specified constraints  Interpretability and usability 11 Clustering Analysis
  • 12. The distance function  Simplest case: one numeric attribute A  Distance(X,Y) = A(X) – A(Y)  Several numeric attributes:  Distance(X,Y) = Euclidean distance between X,Y  Nominal attributes: distance is set to 1 if values are different, 0 if they are equal  Are all attributes equally important?  Weighting the attributes might be necessary 12 Clustering Analysis
  • 13. From Data Matrix to Similarity or Dissimilarity Matrices  Data matrix (or object-by-attribute structure)  m objects with n attributes, e.g., relational data  x11  x1 j  x1n         x  xij  xin   i1          xm1   xmj  xmn    Similarity and dissimilarity matrices  a collection of proximities for all pairs of m objects. 1   0          s  1  d  0   i1   i1        sij  s ji       dij  d ji 13  m1 s  smj  1 0  sij  1  d m1   d mj  0 0  dijAnalysis  Clustering 1
  • 14. Distance Functions (Overview)  To transform a data matrix to similarity or dissimilarity matrices, we need a definition of distance.  Some definitions of distance functions depend on the type of attributes  interval-scaled attributes  Boolean attributes  nominal, ordinal and ratio attributes.  Weights should be associated with different attributes based on applications and data semantics.  It is hard to define “similar enough” or “good enough”  the answer is typically highly subjective. 14 Clustering Analysis
  • 15. Similarity and Dissimilarity Between Objects (I)  Distances are normally used to measure the similarity or dissimilarity between two data objects  Some popular ones include: Minkowski distance: dij  q (| xi1  x j1 |q  | xi 2  x j 2 |q ... | xip  x jp |q ) where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data objects, and q is a positive integer  If q = 1, d is Manhattan distance dij  | xi1  x j1 |  | xi 2  x j 2 | ... | xip  x jp | 15 Clustering Analysis
  • 16. Similarity and Dissimilarity Between Objects (II)  If q = 2, d is Euclidean distance: dij  (| xi1  x j1 |2  | xi 2  x j 2 |2 ... | xip  x jp |2 )  Properties d(i,j) >= 0 d(i,i) = 0 d(i,j) = d(j,i) d(i,j) <= d(i,k) + d(k,j)  Also one can use weighted distance, parametric Pearson product moment correlation, or other dissimilarity measures. d  (w | x  x | w | x  x | ... w | x  x | ) ij 1 i1 j1 2 2 i2 j2 2 i ip jp 2 16 Clustering Analysis
  • 17. Types of Attributes in Clustering  Interval-scaled attributes  Continuous measures of a roughly linear scale  Binary attributes  Two-state measures: 0 or 1  Nominal, ordinal, and ratio attributes  More than two states, nominal or ordinal or nonlinear scale  Mixed types  Mixture of interval-scaled, symmetric binary, asymmetric binary, nominal, ordinal, or ratio-scaled attributes 17 Clustering Analysis: Types of Attributes
  • 18. Interval-valued Attributes  Standardize data  Calculate the mean absolute deviation: s f  1 (| x1 f  m f |  | x2 f  m f | ... | xnf  m f |) n where m f  1 (x1 f  x2 f n  ...  xnf ) .  Calculate the standardized measurement (mean-absolute-based z- xif  m f score) zif  sf  Using mean absolute deviation is more robust than using standard deviation since the z-scores of outliers do not become too small. Hence, the outliers remain detectable. 18 Clustering Analysis: Types of Attributes
  • 19. A binary variable contains two possible outcomes: 1 Binary Attributes (positive/present) or 0 (negative/absent). • If there is no preference for  A contingency table for binary data which outcome should be coded as 0 and which as 1, the binary variable is Object j called symmetric. 1 0 sum • If the outcomes of a binary variable are not equally 1 a b a b important, the binary variable is called asymmetric, such as "is Object i 0 c d cd color-blind" for a human being. The most important outcome sum a  c b  d p is usually coded as 1 (present) and the other is coded as 0 (absent).  Simple matching coefficient (invariant, if the binary variable is symmetric): bc dij  abcd  Jaccard coefficient (noninvariant if the binary variable is asymmetric): bc dij  19 a  b  cAnalysis:Types of Attributes Clustering
  • 20. Dissimilarity on Binary Attributes  Example Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4 Jack M Y N P N N N Mary F Y N P N P N Jim M Y P N N N N  The gender is symmetric attribute and the remaining attributes are asymmetric binary. Here, let the values Y and P be set to 1, and the value N be set to 0. Then calculate only asymmetric binary 0 1 d ( jack , mary)   0.33 2  0 1 11 d ( jack , jim )   0.67 111 1 2 d ( jim , mary)   0.75 11 2 20 Clustering Analysis: Types of Attributes
  • 21. Nominal Attributes  A generalization of the binary variable in that it can take more than 2 states, e.g., red, yellow, blue, green  Method 1: Simple matching  m is the # of matches, p is the total # of nominal attributes pm dij  p  Method 2: Use a large number of binary variables  creating a new binary variable for each of the M nominal states 21 Clustering Analysis: Types of Attributes
  • 22. Ordinal Attributes  An ordinal variable can be discrete or continuous  order is important, e.g., rank  Can be treated like interval-scaled  replacing xif by their rank rif { ,..., M f } 1  map the range of each variable onto [0, 1] by replacing i-th object in the f-th variable by rif 1 zif  M f 1  compute the dissimilarity using methods for interval-scaled variables 22 Clustering Analysis: Types of Attributes
  • 23. Ratio-Scaled Attributes  Ratio-scaled variable: a positive measurement on a nonlinear scale, approximately at exponential scale, such as AeBt or Ae-Bt  Methods: (1) treat them like interval-scaled attributes not a good choice! (2) apply logarithmic transformation yif = log(xif) (3) treat them as continuous ordinal data and treat their rank as interval-scaled. 23 Clustering Analysis: Types of Attributes
  • 24. Mixed Types  A database may contain all the six types of attributes  symmetric binary, asymmetric binary, nominal, ordinal, interval and ratio.  One may use a weighted formula to combine their effects.  p 1 ij f ) dij f ) ( ( dij  f p  f 1 ij f ) (  f is binary or nominal:  dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise  f is interval-based: use the normalized distance  f is ordinal or ratio-scaled  compute ranks rif and  treat (normalized) zif as interval-scaled 24 Clustering Analysis: Types of Attributes
  • 25. Major Clustering Approaches  Partitioning algorithms: Construct various partitions and then evaluate them by some criterion  Hierarchy algorithms: Create a hierarchical decomposition of the set of data (or objects) using some criterion  Density-based: based on connectivity and density functions  Grid-based: based on a multiple-level granularity structure  Model-based: A model is hypothesized for each of the clusters and the idea is to find the best fit of that model to each other 25 Clustering Analysis: Clustering Approaches
  • 26. Partitioning Approach  Construct a partition of a database D of n objects into a set of k clusters  Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion  Global optimal: exhaustively enumerate all partitions  Heuristic methods: k-means and k-medoids algorithms  k-means (MacQueen’67): Each cluster is represented by the center of the cluster  k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87): Each cluster is represented by one of the objects in the cluster 26 Clustering Analysis: Partitioning Algorithms
  • 27. The K-Means Clustering Method (Overview)  Given k, the k-means algorithm is implemented in 4 steps:  Partition objects into k nonempty subsets  Compute seed points as the centroids of the clusters of the current partition. The centroid is the center (mean point) of the cluster.  Assign each object to the cluster with the nearest seed point.  Go back to Step 2, stop when no more new assignment. 27 Clustering Analysis: Partitioning Algorithms
  • 28. The K-Means Clustering Method (An Graphical Example) 10 10 9 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1 1 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 10 10 9 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1 1 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 28 Clustering Analysis: Partitioning Algorithms
  • 29. K-means example, step 1 Given k=3, k1 Y Pick 3 k2 initial cluster centers (randomly) k3 X 29 Clustering Analysis: Partitioning Algorithms
  • 30. K-means example, step 2 k1 Y k2 Assign each point to the closest cluster center k3 X 30 Clustering Analysis: Partitioning Algorithms
  • 31. K-means example, step 3 k1 k1 Y Move k2 each cluster center k3 k2 to the mean of each cluster k3 X 31 Clustering Analysis: Partitioning Algorithms
  • 32. K-means example, step 4 Reassign k1 points Y closest to a different new cluster center k3 Q: Which k2 points are reassigned? X 32 Clustering Analysis: Partitioning Algorithms
  • 33. K-means example, step 4 … k1 Y A: three points with animation k3 k2 X 33 Clustering Analysis: Partitioning Algorithms
  • 34. K-means example, step 4b k1 Y re-compute cluster means k3 k2 X 34 Clustering Analysis: Partitioning Algorithms
  • 35. K-means example, step 5 k1 Y k2 move cluster centers to k3 cluster means X 35 Clustering Analysis: Partitioning Algorithms
  • 36. Problems to be considered  What can be the problems with K-means clustering?  Result can vary significantly depending on initial choice of seeds (number and position)  Can get trapped in local minimum initial cluster  Example: centers instances  Q: What can be done?  A: To increase chance of finding global optimum: restart with different random seeds.  What can be done about outliers? 36 Clustering Analysis: Partitioning Algorithms
  • 37. The K-Means Clustering Method (Strength and Weakness)  Strength  Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n  Good for finding clusters with spherical shapes  Often terminates at a local optimum. The global optimum may be found using techniques such as: deterministic annealing and genetic algorithms  Weakness  Applicable only when mean is defined, then what about categorical data?  Need to specify k, no. of clusters, in advance  Unable to handle noisy data and outliers  Not suitable to discover clusters with non-convex shapes 37 Clustering Analysis: Partitioning Algorithms
  • 38. The K-Means Clustering Method (Variations – I)  A few variants of the k-means which differ in  Selection of the initial k means Mean of 1, 3, 5, 7, 9 is 5 Mean of 1, 3, 5, 7, 1009 is 205  Dissimilarity calculations Median of 1, 3, 5, 7, 1009 is 5 Median advantage: not affected  Strategies to calculate cluster means by extreme values  K-medoids – instead of mean, use medians of each cluster  For large databases, use sampling  Handling categorical data: k-modes (Huang’98)  Replacing means of clusters with modes  Using new dissimilarity measures to deal with categorical objects  Using a frequency-based method to update modes of clusters  A mixture of categorical/numerical data: k-prototype method 38 Clustering Analysis: Partitioning Algorithms
  • 39. The K-Medoids Clustering Method (Overview)  Find representative objects, called medoids, in clusters  PAM (Partitioning Around Medoids, 1987)  starts from an initial set of medoids and iteratively replaces one of the medoids by one of the non-medoids if it improves the total distance of the resulting clustering  PAM works effectively for small data sets, but does not scale well for large data sets  CLARA (Kaufmann & Rousseeuw, 1990)  CLARANS (Ng & Han, 1994): Randomized sampling 39 Clustering Analysis: Partitioning Algorithms
  • 40. The K-Medoids Clustering Method (PAM - Partitioning Around Medoids)  PAM (Kaufman and Rousseeuw, 1987), built in Splus  Use real object to represent the cluster  Select k representative objects arbitrarily  For each pair of non-selected object h and selected object i, calculate the total swapping cost Sih  For each pair of i and h,  If Sih < 0, i is replaced by h  Then assign each non-selected object to the most similar representative object  repeat steps 2-3 until there is no change 40 Clustering Analysis: Partitioning Algorithms
  • 41. PAM example  Cluster the following data set of ten objects into two clusters i.e k = 2. X1 2 6 X2 3 4 X3 3 8 X4 4 7 X5 6 2 X6 6 4 X7 7 3 X8 7 4 X9 8 5 X10 7 6 41 Clustering Analysis
  • 42. PAM example, step 1  Initialise k medoids. Let assume c1 = (3,4) and c2 = (7,4)  Calculating distance so as to associate each data object to its nearestCost medoid. Assume that cost is Cost calculated )using Minkowski distance metric with r =(distan c 1 Data objects (X (distan 2 c Data objects (X ) 1. i ce) i ce) 3 4 2 6 3 7 4 2 6 7 3 4 3 8 4 7 4 3 8 8 3 4 4 7 4 7 4 4 7 6 3 4 6 2 5 7 4 6 2 3 3 4 6 4 3 7 4 6 4 1 3 4 7 3 5 7 4 7 3 1 3 4 8 5 6 7 4 8 5 2 42 Clustering Analysis 3 4 7 6 6 7 4 7 6 2
  • 43. PAM example, step 1b  Then the clusters become: Cluster1 = {(3,4)(2,6)(3,8)(4,7)} Cluster2 = {(7,4)(6,2)(6,4)(7,3)(8,5)(7,6)}  Total cost = 3 + 4 +Cost+ 3 + 1 + 1 + 2 + 2 = 20 4 Cost Data objects Data objects c1 (distan c2 (distan (Xi) (Xi) ce) ce) 3 4 2 6 3 7 4 2 6 7 3 4 3 8 4 7 4 3 8 8 3 4 4 7 4 7 4 4 7 6 3 4 6 2 5 7 4 6 2 3 3 4 6 4 3 7 4 6 4 1 3 4 7 3 5 7 4 7 3 1 3 4 8 5 6 7 4 8 5 2 43 Clustering Analysis 3 4 7 6 6 7 4 7 6 2
  • 44. PAM example, step 2 Selection of nonmedoid O′ randomly. Let us  assume O′ = (7,3). So now the medoids are c1(3,4) and O′(7,3)  Calculate the cost of new medoid by using the Cost Cost Data objects Data objects formula in )the step1. Total cost = c 1 (X (distan O′ (X ) (distan i ce) i ce) 3 3+4+4+2+2+1+3+33 = 22 7 4 2 6 3 2 6 8 3 4 3 8 4 7 3 3 8 9 3 4 4 7 4 7 3 4 7 7 3 4 6 2 5 7 3 6 2 2 3 4 6 4 3 7 3 6 4 2 3 4 7 3 5 7 3 7 4 1 3 4 8 5 6 7 3 8 5 3 44 Clustering Analysis 3 4 7 6 6 7 3 7 6 3
  • 45. PAM example, step 2b  So cost of swapping medoid from c2 to O′ is S = current total cost – past total cost = 22-20 = 2 >0  So moving to O′ would be bad idea, so the previous choice was good, and algorithm terminates here (i.e there is no change in the medoids).  It may happen some data points may shift from one cluster to another cluster depending upon their closeness to medoid 45 Clustering Analysis
  • 46. CLARA (Clustering Large Applications) (1990)  CLARA (Kaufmann and Rousseeuw in 1990)  Built in statistical analysis packages, such as S+  It draws multiple samples of the data set, applies PAM on each sample, and gives the best clustering as the output  Strength: deals with larger data sets than PAM  Weakness:  Efficiency depends on the sample size  A good clustering based on samples will not necessarily represent a good clustering of the whole data set if the sample is biased 46 Clustering Analysis: Partitioning Algorithms
  • 47. CLARANS (“Randomized” CLARA) (1994)  CLARANS (A Clustering Algorithm based on Randomized Search) (Ng and Han’94)  CLARANS draws sample of neighbors dynamically  The clustering process can be presented as searching a graph where every node is a potential solution, that is, a set of k medoids  If the local optimum is found, CLARANS starts with new randomly selected node in search for a new local optimum  It is more efficient and scalable than both PAM and CLARA  Focusing techniques and spatial access structures may further improve its performance (Ester et al.’95) 47 Clustering Analysis: Partitioning Algorithms
  • 48. The Partition-Based Clustering (Discussion)  Result can vary significantly based on initial choice of seeds  Algorithm can get trapped in a local minimum  Example: four instances at the vertices of a two- dimensional rectangle  Local minimum: two cluster centers at the midpoints of the rectangle’s long sides  Simple way to increase chance of finding a global optimum: restart with different random seeds 48 Clustering Analysis: Hierarchical Algorithms
  • 49. Hierarchical Clustering  Use distance matrix as clustering criteria.  This method does not require the number of clusters k as an input, but needs a termination condition Step 0 Step 1 Step 2 Step 3 Step 4 Agglomerative (AGNES) a ab b abcde c cde d de e Divisive Step 4 Step 3 Step 2 Step 1 Step 0 (DIANA) 49 Clustering Analysis: Hierarchical Algorithms
  • 50. AGNES (Agglomerative Nesting)  Introduced in Kaufmann and Rousseeuw (1990)  Implemented in statistical analysis packages, e.g., Splus  Use the Single-Link method and the dissimilarity matrix.  Merge nodes that have the least dissimilarity  Go on in a non-descending fashion  Eventually all nodes belong to the same cluster 10 10 10 9 9 9 8 8 8 7 7 7 6 6 6 5 5 5 4 4 4 3 3 3 2 2 2 1 1 1 0 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 50 Clustering Analysis: Hierarchical Algorithms
  • 51. Dendrogram for Hierarchical Clustering  Decompose data objects into a several levels of nested partitioning (tree of clusters), called a dendrogram.  A clustering of the data objects is obtained by cutting the dendrogram at the desired level, then each connected component forms a cluster. 51 Clustering Analysis: Hierarchical Algorithms
  • 52. DIANA - Divisive Analysis  Introduced in Kaufmann and Rousseeuw (1990)  Implemented in statistical analysis packages, e.g., Splus  Inverse order of AGNES  Eventually each node forms a cluster on its own 10 10 10 9 9 9 8 8 8 7 7 7 6 6 6 5 5 5 4 4 4 3 3 3 2 2 2 1 1 1 0 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 52 Clustering Analysis: Hierarchical Algorithms
  • 53. Hierarchical Clustering  Major weakness of agglomerative clustering methods  do not scale well: time complexity of at least O(n2), where n is the number of total objects  can never undo what was done previously 53 Clustering Analysis: Hierarchical Algorithms
  • 54. Distances - Hierarchical Clustering (Overview)  Four measures for distance between clusters are:  Single linkage (Minimum distance): dmin (Ci , C j )  minpCi , p'Cj p  p'  Complete linkage (Maximum distance): dmax (Ci , C j )  max pCi , p'Cj p  p'  Centroid comparison (Mean distance): d mean (Ci , C j )  mi  m j  Element comparison (Average distance): 1 d avg (Ci , C j )    p  p' 54 ni n j pCi p'C j Analysis: Hierarchical Algorithms Clustering
  • 55. Distances - Hierarchical Clustering (Graphical Representation)  Four measures for distance between clusters are (1) single linkage, (2) complete linkage, (3) centroid comparison and (4) element comparison Cluster 1 (1)single Cluster 2 (4) Element comparison: x x average distance among all elements in two clusters (3)centroid (2)complete Cluster 3 x x = centroids 55 Clustering Analysis: Hierarchical Algorithms
  • 56. Practice  Use single and complete link agglomerative clustering to group the data described by the following distance matrix. Show the dendrograms. A B C D A 0 1 4 5 B 0 2 6 C 0 3 D 0 56 Clustering Analysis
  • 57. Other Clustering Methods  Incremental Clustering  Probability-based Clustering, Bayesian Clustering  EM Algorithm Cluster Schemes  Partitioning Methods  Hierarchical Methods  Density-Based Methods  Grid-Based Methods  Model-Based Clustering Methods 57 Clustering Analysis
  • 58. Advanced Method: BIRCH (Overview)  Balanced Iterative Reducing and Clustering using Hierarchies [Tian Zhang, Raghu Ramakrishnan, Miron Livny, 1996]  Incremental, hierarchical, one scan  Save clustering information in a tree  Each entry in the tree contains information about one cluster  New nodes inserted in closest entry in tree  Only works with "metric" attributes  Must have Euclidean coordinates  Designed for very large data sets  Time and memory constraints are explicit  Treats dense regions of data points as sub-clusters  Not all data points are important for clustering  Only one scan of data is necessary 58 Clustering Analysis
  • 59. BIRCH (Merits)  Incremental, distance-based approach  Decisions are made without scanning all data points, or all currently existing clusters  Does not need the whole data set in advance  Unique approach: Distance-based algorithms generally need all the data points to work  Make best use of available memory while minimizing I/O costs  Does not assume that the probability distributions on attributes is independent 59 Clustering Analysis
  • 60. BIRCH – Clustering Feature and Clustering Feature Tree  BIRTH introduces two concepts, clustering feature and clustering feature tree (CF Tree), which are used to summarize cluster representations.  These structures help the clustering method achieve good speed and scalability in large databases and make it effective for incremental and dynamic clustering of incoming object  Given n d-dimensional data objects or points in a cluster, we can define the centroid x0, radius R and diameter D of the cluster as follows: 60 Clustering Analysis
  • 61. BIRCH – Centroid, Radius and Diameter • Given a cluster of instances , we define: • Centroid: the center of a cluster • Radius: average distance from member points to centroid • Diameter: average pair-wise distance within a cluster 61 Clustering Analysis
  • 62. BIRCH – Centroid Euclidean and Manhattan distances • The centroid Euclidean distance and centroid Manhattan distance are defined between any two clusters. • Centroid Euclidean distance • Centroid Manhattan distance 62 Clustering Analysis
  • 63. BIRCH (Average inter-cluster, Average intra-cluster, Variance increase) • The average inter-cluster, the average intra-cluster, and the variance increase distances are defined as follows • Average inter-cluster • Average intra-cluster • Variance increase distances 63 Clustering Analysis
  • 64. Clustering Feature  CF = (N,LS,SS)  N: Number of points in cluster  LS: Sum of points in the cluster  SS: Sum of squares of points in the cluster  CF Tree  Balanced search tree  Node has CF triple for each child  Leaf node represents cluster and has CF value for each subcluster in it.  Subcluster has maximum diameter 64 Clustering Analysis
  • 65. Clustering Feature Vector (3,4) (4,7) Clustering Feature: CF = (N, LS, SS) (2,6) (3,8) (4,5) N: Number of data points LS: Ni=1=Xi 10 CF = (5, (16,30),(54,190)) SS: Ni=1=Xi2 9 8 7 6 (6,2) (8,4) (7,2) (8,5) 5 4 3 2 (7,4) 1 0 0 1 2 3 4 5 6 7 8 9 10 CF = (5, (36,17),(262,65)) CF = (10, (52,47),(316,255)) CF1  CF2  ( N1  N 2 , LS1  LS2 , SS1  SS2 ) 65 Clustering Analysis
  • 66. Merging two clusters Cluster {Xi}: Cluster {Xj}: i = 1, 2, …, N1 j = N1+1, N1+2, …, N1+N2 Cluster Xl = {Xi} + {Xj}: l = 1, 2, …, N1, N1+1, N1+2, …, N1+N2 66 Clustering Analysis
  • 67. CF Tree Root B=7 CF1 CF2 CF3 CF6 child1 child2 child3 child6 L=6 Non-leaf node CF1 CF2 CF3 CF5 child1 child2 child3 child5 Leaf node Leaf node prev CF1 CF2 CF6 next prev CF1 CF2 CF4 next 67 Clustering Analysis
  • 68. Properties of CF-Tree  Each non-leaf node has at most B entries  Each leaf node has at most L CF entries which each satisfy threshold T  Node size is determined by dimensionality of data space and input parameter P (page size) Branching Factor and Thread hold 68 Clustering Analysis
  • 69. BIRCH Algorithm (CF-Tree Insertion) Recurse down from root, find the appropriate leaf Follow the "closest"-CF path, w.r.t. D0 / … / D4 Modify the leaf If the closest-CF leaf cannot absorb, make a new CF entry. If there is no room for new leaf, split the parent node Traverse back & up 69 Updating CFs on the path Analysis Clustering or splitting nodes
  • 70. Improve Clusters 70 Clustering Analysis
  • 71. BIRCH Algorithm (Overall steps) 71 Clustering Analysis
  • 72. Details of Each Step  Phase 1: Load data into memory  Build an initial in-memory CF-tree with the data (one scan)  Subsequent phases become fast, accurate, less order sensitive  Phase 2: Condense data  Rebuild the CF-tree with a larger T  Condensing is optional  Phase 3: Global clustering  Use existing clustering algorithm on CF entries  Helps fix problem where natural clusters span nodes  Phase 4: Cluster refining  Do additional passes over the dataset & reassign data points to the closest centroid from phase 3  Refining is optional 72 Clustering Analysis
  • 73. Summary of BIRCH  BIRCH works with very large data sets  Explicitly bounded by computational resources.  The computation complexity is O(n), where n is the number of objects to be clustered.  Runs with specified amount of memory (P)  Superior to CLARANS and k-MEANS  Quality, speed, stability and scalability 73 Clustering Analysis
  • 74. CURE (Clustering Using REpresentatives)  CURE was proposed by Guha, Rastogi & Shim, 1998  It stops the creation of a cluster hierarchy if a level consists of k clusters  Each cluster has c representatives (instead of one)  Choose c well scattered points in the cluster  Shrink them towards the mean of the cluster by a fraction of   The representatives capture the physical shape and geometry of the cluster  It can treat arbitrary shaped clusters and avoid single-link effect.  Merge the closest two clusters  Distance of two clusters: the distance between the two closest representatives 74 Clustering Analysis
  • 75. CURE Algorithm 75 Clustering Analysis
  • 76. CURE Algorithm 76 Clustering Analysis
  • 77. CURE Algorithm (Another form) 77 Clustering Analysis
  • 78. CURE Algorithm (for Large Databases) 78 Clustering Analysis
  • 79. Data Partitioning and Clustering y s = 50 p=2 s/p = 25 x s/pq = 5 y y y y x x x x 79 Clustering Analysis
  • 80. Cure: Shrinking Representative Points  Shrink the multiple representative points towards the gravity center by a fraction of .  Multiple representatives capture the shape of the cluster y y x x 80 Clustering Analysis
  • 81. Clustering Categorical Data: ROCK  ROCK: RObust Clustering using linKs, by S. Guha, R. Rastogi, K. Shim (ICDE’99).  Use links to measure similarity/proximity  Not distance-based with categorical attributes  Computational complexity: O(n2  nmmma  n2 log n)  Basic ideas (Jaccard coefficient): Similarity function and neighbors:  T1  T2 Let T1 = {1,2,3}, T2={3,4,5} Sim(T1 , T2 )  T1  T2 {3} 1 Sim(T1, T 2)    0.2 {1,2,3,4,5} 5 81 Clustering Analysis
  • 82. ROCK: An Example  Links: The number of common neighbors for the two points. Using Jaccard  Use Distances to determine neighbors  (pt1,pt4) = 0, (pt1,pt2) = 0, (pt1,pt3) = 0  (pt2,pt3) = 0.6, (pt2,pt4) = 0.2  (pt3,pt4) = 0.2  Use 0.2 as threshold for neighbors  Pt2 and Pt3 have 3 common neighbors  Pt3 and Pt4 have 3 common neighbors  Pt2 and Pt4 have 3 common neighbors  Resulting clusters (1), (2,3,4) which makes more sense 82 Clustering Analysis
  • 83. ROCK: Property & Algorithm  Links: The number of common neighbours for the two points. {1,2,3}, {1,2,4}, {1,2,5}, {1,3,4}, {1,3,5} {1,4,5}, {2,3,4}, {2,3,5}, {2,4,5}, {3,4,5} 3 {1,2,3} {1,2,4}  Algorithm  Draw random sample  Cluster with links (maybe agglomerative hierarchical)  Label data in disk 83 Clustering Analysis
  • 84. CHAMELEON  CHAMELEON: hierarchical clustering using dynamic modeling, by G. Karypis, E.H. Han and V. Kumar’99  Measures the similarity based on a dynamic model  Two clusters are merged only if the interconnectivity and closeness (proximity) between two clusters are high relative to the internal interconnectivity of the clusters and closeness of items within the clusters  A two phase algorithm 1. Use a graph partitioning algorithm: cluster objects into a large number of relatively small sub-clusters 2. Use an agglomerative hierarchical clustering algorithm: find the genuine clusters by repeatedly combining these sub-clusters 84 Clustering Analysis
  • 85. Graph-based clustering  Sparsification techniques keep the connections to the most similar (nearest) neighbors of a point while breaking the connections to less similar points.  The nearest neighbors of a point tend to belong to the same class as the point itself.  This reduces the impact of noise and outliers and sharpens the distinction between clusters. 85 Clustering Analysis
  • 86. Overall Framework of CHAMELEON Construct Sparse Graph Partition the Graph Data Set Merge Partition Final Clusters 86 Clustering Analysis
  • 87. Density-Based Clustering Methods  Clustering based on density (local cluster criterion), such as density-connected points  Major features:  Discover clusters of arbitrary shape  Handle noise  One scan  Need density parameters as termination condition  Several interesting studies:  DBSCAN: Ester, et al. (KDD’96)  OPTICS: Ankerst, et al (SIGMOD’99).  DENCLUE: Hinneburg & D. Keim (KDD’98)  CLIQUE: Agrawal, et al. (SIGMOD’98) 87 Clustering Analysis
  • 88. Density-Based Clustering (Background)  Two parameters:  Eps: Maximum radius of the neighbourhood  MinPts: Minimum number of points in an Eps-neighbourhood of that point  NEps(p): {q belongs to D | dist(p,q) <= Eps}  Directly density-reachable: A point p is directly density -reachable from a point q wrt. Eps, MinPts if 1) p belongs to NEps(q) 2) core point condition: p MinPts = 5 |NEps (q)| >= MinPts q Eps = 1 cm 88 Clustering Analysis
  • 89. Density-Based Clustering (Background)  The neighborhood with in a radius Eps of a given object is call the  neighborhood of the object.  If the neighborhood of an object contains at least a minimum number, MinPts, of objects, then the object is called a core object.  Given a set of objects, D, we say that an object p is directly density- reachable from object q if p is within the neighborhood of q, and q is a core object.  An object p is density-reachable from object q with respect to  and MinPts in a set of objects D, if there is a chain of object p1, …, pn, where p1=q and pn=p such that p(i+1) is directly density-reachable from pi with respect to  and MinPts, for 1≤i≤n, from pi  D.  An object p is density-connected to object q with respect to 89 Clustering Analysis
  • 90. Density-Based Clustering p p1 q  Density-reachable:  A point p is density-reachable from a point q wrt. Eps, MinPts if there is a chain of points p1, …, pn, p1 = q, pn = p such that pi+1 is directly density- reachable from pi  Density-connected  A point p is density-connected to a point q wrt. Eps, MinPts if there is a point o such that both, p and q are density-reachable from o wrt. Eps and MinPts. p q o 90 Clustering Analysis
  • 91. DBSCAN: Density Based Spatial Clustering of Applications with Noise  Relies on a density-based notion of cluster: A cluster is defined as a maximal set of density-connected points  Discovers clusters of arbitrary shape in spatial databases with noise Outlier Border Eps = 1cm Core MinPts = 5 91 Clustering Analysis
  • 92. DBSCAN: The Algorithm  Arbitrary select a point p  Retrieve all points density-reachable from p wrt Eps and MinPts.  If p is a core point, a cluster is formed.  If p is a border point, no points are density-reachable from p and DBSCAN visits the next point of the database.  Continue the process until all of the points have been processed. 92 Clustering Analysis