SlideShare una empresa de Scribd logo
1 de 89
Descargar para leer sin conexión
A Framework for Optimum Document
            Clustering:
Implementing the Cluster Hypothesis

              Norbert Fuhr

         University of Duisburg-Essen


            March 30, 2011
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   2




      Outline


       1    Introduction

       2    Cluster Metric

       3    Optimum clustering

       4    Towards Optimum Clustering

       5    Experiments

       6    Conclusion and Outlook
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   3
  Introduction




       1     Introduction

       2     Cluster Metric

       3     Optimum clustering

       4     Towards Optimum Clustering

       5     Experiments

       6     Conclusion and Outlook
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   4
  Introduction



      Motivation

       Ad-hoc Retrieval
           heuristic models:
                      define retrieval function
                      evaluate to test if it yields good quality
                 Probability Ranking Principle (PRP)
                      theoretic foundation for optimum retrieval
                      numerous probabilistic models based on PRP

       Document clustering
                 classic approach:
                      define similarity function and fusion principle
                      evaluate to test if they yield good quality
                 Optimum Clustering Principle?
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   4
  Introduction



      Motivation

       Ad-hoc Retrieval
           heuristic models:
                      define retrieval function
                      evaluate to test if it yields good quality
                 Probability Ranking Principle (PRP)
                      theoretic foundation for optimum retrieval
                      numerous probabilistic models based on PRP

       Document clustering
                 classic approach:
                      define similarity function and fusion principle
                      evaluate to test if they yield good quality
                 Optimum Clustering Principle?
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   4
  Introduction



      Motivation

       Ad-hoc Retrieval
           heuristic models:
                      define retrieval function
                      evaluate to test if it yields good quality
                 Probability Ranking Principle (PRP)
                      theoretic foundation for optimum retrieval
                      numerous probabilistic models based on PRP

       Document clustering
                 classic approach:
                      define similarity function and fusion principle
                      evaluate to test if they yield good quality
                 Optimum Clustering Principle?
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   5
  Introduction



      Cluster Hypothesis


       Original Formulation
       ”closely associated documents tend to be relevant to the same
       requests” (Rijsbergen 1979)

       Idea of optimum clustering:
       Cluster documents in such a way, that for any request, the
       relevant documents occur together in one cluster
         redefine document similarity:
       documents are similar if they are relevant to the same queries
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   5
  Introduction



      Cluster Hypothesis


       Original Formulation
       ”closely associated documents tend to be relevant to the same
       requests” (Rijsbergen 1979)

       Idea of optimum clustering:
       Cluster documents in such a way, that for any request, the
       relevant documents occur together in one cluster
         redefine document similarity:
       documents are similar if they are relevant to the same queries
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   5
  Introduction



      Cluster Hypothesis


       Original Formulation
       ”closely associated documents tend to be relevant to the same
       requests” (Rijsbergen 1979)

       Idea of optimum clustering:
       Cluster documents in such a way, that for any request, the
       relevant documents occur together in one cluster
         redefine document similarity:
       documents are similar if they are relevant to the same queries
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   6
  Introduction



      The Optimum Clustering Framework
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   6
  Introduction



      The Optimum Clustering Framework
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   6
  Introduction



      The Optimum Clustering Framework
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   6
  Introduction



      The Optimum Clustering Framework
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   7
  Cluster Metric




       1     Introduction

       2     Cluster Metric

       3     Optimum clustering

       4     Towards Optimum Clustering

       5     Experiments

       6     Conclusion and Outlook
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   8
  Cluster Metric



      Defining a Metric based on the Cluster Hypothesis



       General idea:
          Evaluate clustering wrt. a set of queries
          For each query and each cluster, regard pairs of
          documents co-occurring:
                      relevant-relevant: good
                      relevant-irrelevant: bad
                      irrelevant-irrelevant: don’t care
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis                 9
  Cluster Metric



      Pairwise precision

                       Q Set of queries
                       D Document collection
                       R relevance judgments: R ⊂ Q × D
                       C Clustering, C = {C1 , . . . , Cn } s.th. ∪n Ci = D and
                                                                   i=1
                         ∀i, j : i = j → Ci ∩ Cj = ∅
                    ci = |Ci | (size of cluster Ci ),
                   rik = |{dm ∈ Ci |(qk , dm ) ∈ R}| (number of relevant
                         documents in Ci wrt. qk )

       Pairwise precision (weighted average over all clusters)
                                                      1                          rik (rik − 1)
                        Pp (D, Q, R, C) =                           ci
                                                     |D|                          ci (ci − 1)
                                                            Ci ∈C        qk ∈Q
                                                            ci >1
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis                 9
  Cluster Metric



      Pairwise precision

                       Q Set of queries
                       D Document collection
                       R relevance judgments: R ⊂ Q × D
                       C Clustering, C = {C1 , . . . , Cn } s.th. ∪n Ci = D and
                                                                   i=1
                         ∀i, j : i = j → Ci ∩ Cj = ∅
                    ci = |Ci | (size of cluster Ci ),
                   rik = |{dm ∈ Ci |(qk , dm ) ∈ R}| (number of relevant
                         documents in Ci wrt. qk )

       Pairwise precision (weighted average over all clusters)
                                                      1                          rik (rik − 1)
                        Pp (D, Q, R, C) =                           ci
                                                     |D|                          ci (ci − 1)
                                                            Ci ∈C        qk ∈Q
                                                            ci >1
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis                 9
  Cluster Metric



      Pairwise precision

                       Q Set of queries
                       D Document collection
                       R relevance judgments: R ⊂ Q × D
                       C Clustering, C = {C1 , . . . , Cn } s.th. ∪n Ci = D and
                                                                   i=1
                         ∀i, j : i = j → Ci ∩ Cj = ∅
                    ci = |Ci | (size of cluster Ci ),
                   rik = |{dm ∈ Ci |(qk , dm ) ∈ R}| (number of relevant
                         documents in Ci wrt. qk )

       Pairwise precision (weighted average over all clusters)
                                                      1                          rik (rik − 1)
                        Pp (D, Q, R, C) =                           ci
                                                     |D|                          ci (ci − 1)
                                                            Ci ∈C        qk ∈Q
                                                            ci >1
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis                 9
  Cluster Metric



      Pairwise precision

                       Q Set of queries
                       D Document collection
                       R relevance judgments: R ⊂ Q × D
                       C Clustering, C = {C1 , . . . , Cn } s.th. ∪n Ci = D and
                                                                   i=1
                         ∀i, j : i = j → Ci ∩ Cj = ∅
                    ci = |Ci | (size of cluster Ci ),
                   rik = |{dm ∈ Ci |(qk , dm ) ∈ R}| (number of relevant
                         documents in Ci wrt. qk )

       Pairwise precision (weighted average over all clusters)
                                                      1                          rik (rik − 1)
                        Pp (D, Q, R, C) =                           ci
                                                     |D|                          ci (ci − 1)
                                                            Ci ∈C        qk ∈Q
                                                            ci >1
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis                 9
  Cluster Metric



      Pairwise precision

                       Q Set of queries
                       D Document collection
                       R relevance judgments: R ⊂ Q × D
                       C Clustering, C = {C1 , . . . , Cn } s.th. ∪n Ci = D and
                                                                   i=1
                         ∀i, j : i = j → Ci ∩ Cj = ∅
                    ci = |Ci | (size of cluster Ci ),
                   rik = |{dm ∈ Ci |(qk , dm ) ∈ R}| (number of relevant
                         documents in Ci wrt. qk )

       Pairwise precision (weighted average over all clusters)
                                                      1                          rik (rik − 1)
                        Pp (D, Q, R, C) =                           ci
                                                     |D|                          ci (ci − 1)
                                                            Ci ∈C        qk ∈Q
                                                            ci >1
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis                 10
  Cluster Metric



      Pairwise precision – Example


                                                      1                          rik (rik − 1)
                        Pp (D, Q, R, C) =                           ci
                                                     |D|                          ci (ci − 1)
                                                            Ci ∈C        qk ∈Q
                                                            ci >1



               Query set: disjoint classification with two classes a and b,
               three clusters: (aab|bb|aa)
               Pp = 1 (3( 1 + 0) + 2(0 + 1) + 2(1 + 0)) = 5 .
                    7     3                               7

               Perfect clustering for a disjoint classification would yield
               Pp = 1
               for arbitrary query sets, values > 1 are possible
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis                 10
  Cluster Metric



      Pairwise precision – Example


                                                      1                          rik (rik − 1)
                        Pp (D, Q, R, C) =                           ci
                                                     |D|                          ci (ci − 1)
                                                            Ci ∈C        qk ∈Q
                                                            ci >1



               Query set: disjoint classification with two classes a and b,
               three clusters: (aab|bb|aa)
               Pp = 1 (3( 1 + 0) + 2(0 + 1) + 2(1 + 0)) = 5 .
                    7     3                               7

               Perfect clustering for a disjoint classification would yield
               Pp = 1
               for arbitrary query sets, values > 1 are possible
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis                11
  Cluster Metric



      Pairwise recall

                   rik = |{dm ∈ Ci |(qk , dm ) ∈ R}| (number of relevant
                         documents in Ci wrt. qk )
                   gk = |{d ∈ D|(qk , d) ∈ R}| (number of relevant
                        documents for qk )
       (micro recall)

                                                          qk ∈Q         Ci ∈C rik (rik   − 1)
                         Rp (D, Q, R, C) =
                                                                qk ∈Q    gk (gk − 1)
                                                                gk >1


       Example: (aab|bb|aa)
               2 a pairs (out of 6)
               1 b pair (out of 3)
                        2+1      1
               Rp =     6+3    = 3.
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis                11
  Cluster Metric



      Pairwise recall

                   rik = |{dm ∈ Ci |(qk , dm ) ∈ R}| (number of relevant
                         documents in Ci wrt. qk )
                   gk = |{d ∈ D|(qk , d) ∈ R}| (number of relevant
                         documents for qk )
       (micro recall)

                                                          qk ∈Q         Ci ∈C rik (rik   − 1)
                         Rp (D, Q, R, C) =
                                                                qk ∈Q    gk (gk − 1)
                                                                gk >1



       Example: (aab|bb|aa)
               2 a pairs (out of 6)
               1 b pair (out of 3)
                        2+1      1
               Rp =     6+3    = 3.
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   12
  Cluster Metric



      Perfect clustering


       C is a perfect clustering iff there exists no clustering C s.th.
       Pp (D, Q, R, C) < Pp (D, Q, R, C )∧
       Rp (D, Q, R, C) < Rp (D, Q, R, C )

       strong Pareto optimum – more than one perfect clustering
       possible
                               Pp ({d1 , d2 , d3 }, {d4 , d5 }) =
                               Pp ({d1 , d2 }, {d3 , d4 , d5 }) = 1,
       Example:                Rp = 23
                               Pp ({d1 , d2 , d3 , d4 , d5 }) = 0.6,
                               Rp = 1
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   12
  Cluster Metric



      Perfect clustering


       C is a perfect clustering iff there exists no clustering C s.th.
       Pp (D, Q, R, C) < Pp (D, Q, R, C )∧
       Rp (D, Q, R, C) < Rp (D, Q, R, C )

       strong Pareto optimum – more than one perfect clustering
       possible
                               Pp ({d1 , d2 , d3 }, {d4 , d5 }) =
                               Pp ({d1 , d2 }, {d3 , d4 , d5 }) = 1,
       Example:                Rp = 23
                               Pp ({d1 , d2 , d3 , d4 , d5 }) = 0.6,
                               Rp = 1
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   12
  Cluster Metric



      Perfect clustering

       C is a perfect clustering iff there exists no clustering C s.th.
       Pp (D, Q, R, C) < Pp (D, Q, R, C )∧
       Rp (D, Q, R, C) < Rp (D, Q, R, C )

       strong Pareto optimum – more than one perfect clustering
       possible
       Example:
               Pp ({d1 , d2 , d3 }, {d4 , d5 }) =
               Pp ({d1 , d2 }, {d3 , d4 , d5 }) = 1,
               Rp = 23
               Pp ({d1 , d2 , d3 , d4 , d5 }) = 0.6,
               Rp = 1
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   13
  Cluster Metric



      Do perfect clusterings form a hierarchy?
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis                13
  Cluster Metric



      Do perfect clusterings form a hierarchy?

                                                        Pp

                                                         1


                                                                                       C




                                                                                   1       Rp


                                          C = {{d1 , d2 , d3 , d4 }}
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis                 13
  Cluster Metric



      Do perfect clusterings form a hierarchy?

                                                        Pp

                                                         1                     C’


                                                                                        C




                                                                                    1       Rp


                                          C = {{d1 , d2 , d3 , d4 }}

       C = {{d1 , d2 }, {d3 , d4 }}
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis                        13
  Cluster Metric



      Do perfect clusterings form a hierarchy?

                                                        Pp

                                                         1                     C’

                                                                                    C’’
                                                                                              C




                                                                                          1       Rp


                                          C = {{d1 , d2 , d3 , d4 }}

       C = {{d1 , d2 }, {d3 , d4 }}                                      C = {{d1 , d2 , d3 }, {d4 }}
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   14
  Optimum clustering




       1    Introduction

       2    Cluster Metric

       3    Optimum clustering

       4    Towards Optimum Clustering

       5    Experiments

       6    Conclusion and Outlook
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   15
  Optimum clustering



      Optimum Clustering



              Usually, clustering process has no knowledge about
              relevance judgments
              switch from external to internal cluster measures
              replace relevance judgments by estimates of probability of
              relevance
              requires probabilistic retrieval method yielding P(rel|q, d)
                   compute expected cluster quality
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   15
  Optimum clustering



      Optimum Clustering



              Usually, clustering process has no knowledge about
              relevance judgments
              switch from external to internal cluster measures
              replace relevance judgments by estimates of probability of
              relevance
              requires probabilistic retrieval method yielding P(rel|q, d)
                   compute expected cluster quality
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   15
  Optimum clustering



      Optimum Clustering



              Usually, clustering process has no knowledge about
              relevance judgments
              switch from external to internal cluster measures
              replace relevance judgments by estimates of probability of
              relevance
              requires probabilistic retrieval method yielding P(rel|q, d)
                   compute expected cluster quality
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   15
  Optimum clustering



      Optimum Clustering



              Usually, clustering process has no knowledge about
              relevance judgments
              switch from external to internal cluster measures
              replace relevance judgments by estimates of probability of
              relevance
              requires probabilistic retrieval method yielding P(rel|q, d)
                   compute expected cluster quality
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   15
  Optimum clustering



      Optimum Clustering



              Usually, clustering process has no knowledge about
              relevance judgments
              switch from external to internal cluster measures
              replace relevance judgments by estimates of probability of
              relevance
              requires probabilistic retrieval method yielding P(rel|q, d)
                   compute expected cluster quality
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis                              16
  Optimum clustering



      Expected cluster quality


      Pairwise precision:

                                                      1                          rik (rik − 1)
                        Pp (D, Q, R, C) =                           ci
                                                     |D|                          ci (ci − 1)
                                                            Ci ∈C        qk ∈Q
                                                            ci >1


      Expected precision:

                        1                    ci
 π(D, Q, C) =
  !                                                                                P(rel|qk , dl )P(rel|qk , dm )
                       |D|              ci (ci − 1)
                               Ci ∈C                   qk ∈Q   (dl ,dm )∈Ci ×Ci
                              |Ci |>1                                dl =dm
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis                              16
  Optimum clustering



      Expected cluster quality


      Pairwise precision:

                                                      1                          rik (rik − 1)
                        Pp (D, Q, R, C) =                           ci
                                                     |D|                          ci (ci − 1)
                                                            Ci ∈C        qk ∈Q
                                                            ci >1


      Expected precision:

                        1                    ci
 π(D, Q, C) =
  !                                                                                P(rel|qk , dl )P(rel|qk , dm )
                       |D|              ci (ci − 1)
                               Ci ∈C                   qk ∈Q   (dl ,dm )∈Ci ×Ci
                              |Ci |>1                                dl =dm
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis                             17
  Optimum clustering



      Expected precision


                        1                   ci
π(D, Q, C) =                                                                     P(rel|qk , dl )P(rel|qk , dm )
                       |D|             ci (ci − 1)
                              Ci ∈C                   (dl ,dm )∈Ci ×Ci   qk ∈Q
                             |Ci |>1                        dl =dm


      here qk ∈Q P(rel|qk , dl )P(rel|qk , dm ) gives the expected
      number of queries for which both dl and dm are relevant
      Transform a document into a vector of relevance probabilities:
      τ T (dm ) = (P(rel|q1 , dm ), P(rel|q2 , dm ), . . . , P(rel|q|Q| , dm )).


                                          1                 1
             π(D, Q, C) =                                                             τ T (dl ) · τ (dm )
                                         |D|    Ci ∈C
                                                         ci − 1    (dl ,dm )∈Ci ×Ci
                                               |Ci |>1                   dl =dm
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis                             17
  Optimum clustering



      Expected precision


                        1                   ci
π(D, Q, C) =                                                                     P(rel|qk , dl )P(rel|qk , dm )
                       |D|             ci (ci − 1)
                              Ci ∈C                   (dl ,dm )∈Ci ×Ci   qk ∈Q
                             |Ci |>1                        dl =dm


      here qk ∈Q P(rel|qk , dl )P(rel|qk , dm ) gives the expected
      number of queries for which both dl and dm are relevant
      Transform a document into a vector of relevance probabilities:
      τ T (dm ) = (P(rel|q1 , dm ), P(rel|q2 , dm ), . . . , P(rel|q|Q| , dm )).


                                          1                 1
             π(D, Q, C) =                                                             τ T (dl ) · τ (dm )
                                         |D|    Ci ∈C
                                                         ci − 1    (dl ,dm )∈Ci ×Ci
                                               |Ci |>1                   dl =dm
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis                  18
  Optimum clustering



      Expected recall

                                                          qk ∈Q           Ci ∈C rik (rik   − 1)
                         Rp (D, Q, R, C) =
                                                                qk ∈Q      gk (gk − 1)
                                                                gk >1

      Direct estimation requires estimation of denominator → biased
      estimates
      But: denominator is constant for a given query set → ignore
      compute an estimate for the numerator only:

                         ρ(D, Q, C) =                                      τ T (dl ) · τ (dm )
                                               Ci ∈C   (dl ,dm )∈Ci ×Ci
                                                             dl =dm



      (Scalar product τ T (dl ) · τ (dm ) gives the expected number of
      queries for which both dl and dm are relevant)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis                  18
  Optimum clustering



      Expected recall

                                                          qk ∈Q           Ci ∈C rik (rik   − 1)
                         Rp (D, Q, R, C) =
                                                                qk ∈Q      gk (gk − 1)
                                                                gk >1

      Direct estimation requires estimation of denominator → biased
      estimates
      But: denominator is constant for a given query set → ignore
      compute an estimate for the numerator only:

                         ρ(D, Q, C) =                                      τ T (dl ) · τ (dm )
                                               Ci ∈C   (dl ,dm )∈Ci ×Ci
                                                             dl =dm



      (Scalar product τ T (dl ) · τ (dm ) gives the expected number of
      queries for which both dl and dm are relevant)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis                  18
  Optimum clustering



      Expected recall

                                                          qk ∈Q           Ci ∈C rik (rik   − 1)
                         Rp (D, Q, R, C) =
                                                                qk ∈Q      gk (gk − 1)
                                                                gk >1

      Direct estimation requires estimation of denominator → biased
      estimates
      But: denominator is constant for a given query set → ignore
      compute an estimate for the numerator only:

                         ρ(D, Q, C) =                                      τ T (dl ) · τ (dm )
                                               Ci ∈C   (dl ,dm )∈Ci ×Ci
                                                             dl =dm



      (Scalar product τ T (dl ) · τ (dm ) gives the expected number of
      queries for which both dl and dm are relevant)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis                  18
  Optimum clustering



      Expected recall

                                                          qk ∈Q           Ci ∈C rik (rik   − 1)
                         Rp (D, Q, R, C) =
                                                                qk ∈Q      gk (gk − 1)
                                                                gk >1

      Direct estimation requires estimation of denominator → biased
      estimates
      But: denominator is constant for a given query set → ignore
      compute an estimate for the numerator only:

                         ρ(D, Q, C) =                                      τ T (dl ) · τ (dm )
                                               Ci ∈C   (dl ,dm )∈Ci ×Ci
                                                             dl =dm



      (Scalar product τ T (dl ) · τ (dm ) gives the expected number of
      queries for which both dl and dm are relevant)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   19
  Optimum clustering



      Optimum clustering



      C is an optimum clustering iff there exists no clustering C s.th.
      π(D, Q, C) < π(D, Q, C ) ∧ ρ(D, Q, C) < ρ(D, Q, C )
              Pareto optima
              Set of perfect (and optimum) clusterings not even forms a
              cluster hierarchy
                   no hierarchic clustering method will find all optima!
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   19
  Optimum clustering



      Optimum clustering



      C is an optimum clustering iff there exists no clustering C s.th.
      π(D, Q, C) < π(D, Q, C ) ∧ ρ(D, Q, C) < ρ(D, Q, C )
              Pareto optima
              Set of perfect (and optimum) clusterings not even forms a
              cluster hierarchy
                   no hierarchic clustering method will find all optima!
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   19
  Optimum clustering



      Optimum clustering



      C is an optimum clustering iff there exists no clustering C s.th.
      π(D, Q, C) < π(D, Q, C ) ∧ ρ(D, Q, C) < ρ(D, Q, C )
              Pareto optima
              Set of perfect (and optimum) clusterings not even forms a
              cluster hierarchy
                   no hierarchic clustering method will find all optima!
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   19
  Optimum clustering



      Optimum clustering



      C is an optimum clustering iff there exists no clustering C s.th.
      π(D, Q, C) < π(D, Q, C ) ∧ ρ(D, Q, C) < ρ(D, Q, C )
              Pareto optima
              Set of perfect (and optimum) clusterings not even forms a
              cluster hierarchy
                   no hierarchic clustering method will find all optima!
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   20
  Towards Optimum Clustering




       1    Introduction

       2    Cluster Metric

       3    Optimum clustering

       4    Towards Optimum Clustering

       5    Experiments

       6    Conclusion and Outlook
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   21
  Towards Optimum Clustering



      Towards Optimum Clustering
      Development of an (optimum) clustering method




          1   Set of queries,
          2   Probabilistic retrieval method,
          3   Document similarity metric, and
          4   Fusion principle.
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis                  22
  Towards Optimum Clustering



      A simple application


          1    Set of queries: all possible one-term queries
          2    Probabilistic retrieval method: tf ∗ idf
          3    Document similarity metric: τ T (dl ) · τ (dm )
          4    Fusion principle: group average clustering
                                                1
                     π(D, Q, C) =                                           τ T (dl ) · τ (dm )
                                             c(c − 1)       (dl ,dm )∈C×C
                                                                 dl =dm



              standard clustering method
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis                  22
  Towards Optimum Clustering



      A simple application


          1    Set of queries: all possible one-term queries
          2    Probabilistic retrieval method: tf ∗ idf
          3    Document similarity metric: τ T (dl ) · τ (dm )
          4    Fusion principle: group average clustering
                                                1
                     π(D, Q, C) =                                           τ T (dl ) · τ (dm )
                                             c(c − 1)       (dl ,dm )∈C×C
                                                                 dl =dm



              standard clustering method
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis                  22
  Towards Optimum Clustering



      A simple application


          1    Set of queries: all possible one-term queries
          2    Probabilistic retrieval method: tf ∗ idf
          3    Document similarity metric: τ T (dl ) · τ (dm )
          4    Fusion principle: group average clustering
                                                1
                     π(D, Q, C) =                                           τ T (dl ) · τ (dm )
                                             c(c − 1)       (dl ,dm )∈C×C
                                                                 dl =dm



              standard clustering method
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis                  22
  Towards Optimum Clustering



      A simple application


          1    Set of queries: all possible one-term queries
          2    Probabilistic retrieval method: tf ∗ idf
          3    Document similarity metric: τ T (dl ) · τ (dm )
          4    Fusion principle: group average clustering
                                                1
                     π(D, Q, C) =                                           τ T (dl ) · τ (dm )
                                             c(c − 1)       (dl ,dm )∈C×C
                                                                 dl =dm



              standard clustering method
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis                  22
  Towards Optimum Clustering



      A simple application


          1    Set of queries: all possible one-term queries
          2    Probabilistic retrieval method: tf ∗ idf
          3    Document similarity metric: τ T (dl ) · τ (dm )
          4    Fusion principle: group average clustering
                                                1
                     π(D, Q, C) =                                           τ T (dl ) · τ (dm )
                                             c(c − 1)       (dl ,dm )∈C×C
                                                                 dl =dm



              standard clustering method
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis                  22
  Towards Optimum Clustering



      A simple application


          1    Set of queries: all possible one-term queries
          2    Probabilistic retrieval method: tf ∗ idf
          3    Document similarity metric: τ T (dl ) · τ (dm )
          4    Fusion principle: group average clustering
                                                1
                     π(D, Q, C) =                                           τ T (dl ) · τ (dm )
                                             c(c − 1)       (dl ,dm )∈C×C
                                                                 dl =dm



              standard clustering method
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis                  22
  Towards Optimum Clustering



      A simple application


          1    Set of queries: all possible one-term queries
          2    Probabilistic retrieval method: tf ∗ idf
          3    Document similarity metric: τ T (dl ) · τ (dm )
          4    Fusion principle: group average clustering
                                                1
                     π(D, Q, C) =                                           τ T (dl ) · τ (dm )
                                             c(c − 1)       (dl ,dm )∈C×C
                                                                 dl =dm



              standard clustering method
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   23
  Towards Optimum Clustering



      Query set


      Too few queries in real collections → artificial query set
              collection clustering: set of all possible one-term queries
                      Probability distribution over the query set: uniform /
                      proportional to doc. freq.
                      Document representation: original terms / transformations
                      of the term space
                      Semantic dimensions: focus on certain aspects only (e.g.
                      images: color, contour, texture)
              result clustering: set of all query expansions
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   24
  Towards Optimum Clustering



      Probabilistic retrieval method




              Model: In principle, any retrieval model suitable
              Transformation to probabilities: direct estimation /
              transforming the retrieval score into such a probability
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   25
  Towards Optimum Clustering



      Document similarity metric.




      fixed as τ T (dl ) · τ (dm )
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   26
  Towards Optimum Clustering



      Fusion principles




      OCF only gives guidelines for good fusion principles:
      consider metrics π and/or ρ during fusion
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis             27
  Towards Optimum Clustering



      Group average clustering:



                                           1
                          σ(C) =                                       τ T (dl ) · τ (dm )
                                        c(c − 1)       (dl ,dm )∈C×C
                                                            dl =dm



              expected precision as criterion!
              starts with singleton clusters                        minimum recall
              building larger clusters for increasing recall
              forms cluster with highest precision
              (which may be lower than that of the current clusters)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis             27
  Towards Optimum Clustering



      Group average clustering:



                                           1
                          σ(C) =                                       τ T (dl ) · τ (dm )
                                        c(c − 1)       (dl ,dm )∈C×C
                                                            dl =dm



              expected precision as criterion!
              starts with singleton clusters                        minimum recall
              building larger clusters for increasing recall
              forms cluster with highest precision
              (which may be lower than that of the current clusters)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis             27
  Towards Optimum Clustering



      Group average clustering:



                                           1
                          σ(C) =                                       τ T (dl ) · τ (dm )
                                        c(c − 1)       (dl ,dm )∈C×C
                                                            dl =dm



              expected precision as criterion!
              starts with singleton clusters                        minimum recall
              building larger clusters for increasing recall
              forms cluster with highest precision
              (which may be lower than that of the current clusters)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis             27
  Towards Optimum Clustering



      Group average clustering:



                                           1
                          σ(C) =                                       τ T (dl ) · τ (dm )
                                        c(c − 1)       (dl ,dm )∈C×C
                                                            dl =dm



              expected precision as criterion!
              starts with singleton clusters                        minimum recall
              building larger clusters for increasing recall
              forms cluster with highest precision
              (which may be lower than that of the current clusters)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis                  28
  Towards Optimum Clustering



      Fusion principles – min cut




              starts with single cluster (maximum recall)
              searches for cut with minimum loss in recall

                               ρ(D, Q, C) =                                 τ T (dl ) · τ (dm )
                                                    Ci ∈C   (dl ,dm )∈C×C
                                                                 dl =dm


              consider expected precision for breaking ties!
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis                  28
  Towards Optimum Clustering



      Fusion principles – min cut




              starts with single cluster (maximum recall)
              searches for cut with minimum loss in recall

                               ρ(D, Q, C) =                                 τ T (dl ) · τ (dm )
                                                    Ci ∈C   (dl ,dm )∈C×C
                                                                 dl =dm


              consider expected precision for breaking ties!
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis                  28
  Towards Optimum Clustering



      Fusion principles – min cut




              starts with single cluster (maximum recall)
              searches for cut with minimum loss in recall

                               ρ(D, Q, C) =                                 τ T (dl ) · τ (dm )
                                                    Ci ∈C   (dl ,dm )∈C×C
                                                                 dl =dm


              consider expected precision for breaking ties!
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   29
  Towards Optimum Clustering



      Finding optimum clusterings

      Min cut
      (assume cohesive similarity graph)
              starts with optimum clustering for maximum recall
              min cut finds split with minimum loss in recall
              consider precision for tie breaking
                   optimum clustering for two clusters
              O(n3 ) (vs. O(2n ) for the general case)
              subsequent splits will not necessarily reach optima

      Group average
              in general, multiple fusion steps for reaching first optimum
              greedy strategy does not necessarily find this optimum!
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   29
  Towards Optimum Clustering



      Finding optimum clusterings

      Min cut
      (assume cohesive similarity graph)
              starts with optimum clustering for maximum recall
              min cut finds split with minimum loss in recall
              consider precision for tie breaking
                   optimum clustering for two clusters
              O(n3 ) (vs. O(2n ) for the general case)
              subsequent splits will not necessarily reach optima

      Group average
              in general, multiple fusion steps for reaching first optimum
              greedy strategy does not necessarily find this optimum!
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   29
  Towards Optimum Clustering



      Finding optimum clusterings

      Min cut
      (assume cohesive similarity graph)
              starts with optimum clustering for maximum recall
              min cut finds split with minimum loss in recall
              consider precision for tie breaking
                   optimum clustering for two clusters
              O(n3 ) (vs. O(2n ) for the general case)
              subsequent splits will not necessarily reach optima

      Group average
              in general, multiple fusion steps for reaching first optimum
              greedy strategy does not necessarily find this optimum!
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   29
  Towards Optimum Clustering



      Finding optimum clusterings

      Min cut
      (assume cohesive similarity graph)
              starts with optimum clustering for maximum recall
              min cut finds split with minimum loss in recall
              consider precision for tie breaking
                   optimum clustering for two clusters
              O(n3 ) (vs. O(2n ) for the general case)
              subsequent splits will not necessarily reach optima

      Group average
              in general, multiple fusion steps for reaching first optimum
              greedy strategy does not necessarily find this optimum!
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   29
  Towards Optimum Clustering



      Finding optimum clusterings

      Min cut
      (assume cohesive similarity graph)
              starts with optimum clustering for maximum recall
              min cut finds split with minimum loss in recall
              consider precision for tie breaking
                   optimum clustering for two clusters
              O(n3 ) (vs. O(2n ) for the general case)
              subsequent splits will not necessarily reach optima

      Group average
              in general, multiple fusion steps for reaching first optimum
              greedy strategy does not necessarily find this optimum!
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   29
  Towards Optimum Clustering



      Finding optimum clusterings

      Min cut
      (assume cohesive similarity graph)
              starts with optimum clustering for maximum recall
              min cut finds split with minimum loss in recall
              consider precision for tie breaking
                   optimum clustering for two clusters
              O(n3 ) (vs. O(2n ) for the general case)
              subsequent splits will not necessarily reach optima

      Group average
              in general, multiple fusion steps for reaching first optimum
              greedy strategy does not necessarily find this optimum!
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   29
  Towards Optimum Clustering



      Finding optimum clusterings

      Min cut
      (assume cohesive similarity graph)
              starts with optimum clustering for maximum recall
              min cut finds split with minimum loss in recall
              consider precision for tie breaking
                   optimum clustering for two clusters
              O(n3 ) (vs. O(2n ) for the general case)
              subsequent splits will not necessarily reach optima

      Group average
              in general, multiple fusion steps for reaching first optimum
              greedy strategy does not necessarily find this optimum!
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   29
  Towards Optimum Clustering



      Finding optimum clusterings

      Min cut
      (assume cohesive similarity graph)
              starts with optimum clustering for maximum recall
              min cut finds split with minimum loss in recall
              consider precision for tie breaking
                   optimum clustering for two clusters
              O(n3 ) (vs. O(2n ) for the general case)
              subsequent splits will not necessarily reach optima

      Group average
              in general, multiple fusion steps for reaching first optimum
              greedy strategy does not necessarily find this optimum!
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   29
  Towards Optimum Clustering



      Finding optimum clusterings

      Min cut
      (assume cohesive similarity graph)
              starts with optimum clustering for maximum recall
              min cut finds split with minimum loss in recall
              consider precision for tie breaking
                   optimum clustering for two clusters
              O(n3 ) (vs. O(2n ) for the general case)
              subsequent splits will not necessarily reach optima

      Group average
              in general, multiple fusion steps for reaching first optimum
              greedy strategy does not necessarily find this optimum!
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   30
  Experiments




       1    Introduction

       2    Cluster Metric

       3    Optimum clustering

       4    Towards Optimum Clustering

       5    Experiments

       6    Conclusion and Outlook
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   31
  Experiments



      Experiments with a Query Set


      ADI collection:
                35 queries
                70 documents (relevant to 2.4 queries on avg.)

      Experiments:
                Q35opt using the actual relevance in τ (d)
                   Q35 BM25 estimates for the 35 queries
                  1Tuni 1-term queries, uniform distribution
                   1Tdf 1-term queries, according to document frequency
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis                32
  Experiments




                         2.5
                                                                                   Q35opt
                                                                                     Q35
                          2                                                         1Tuni
                                                                                     1Tdf
             Precision




                         1.5


                          1


                         0.5


                          0
                               0   0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9                      1
                                                               Recall
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   33
  Experiments



      Using Keyphrases as Query Set
      Compare clustering results based on different query sets

          1     ‘bag-of-words’: single words as queries
          2     keyphrases automatically extracted as head-noun phrases,
                single query = all keyphrases of a document

      Test collections:
                4 test collections assembled from the RCV1 (Reuters)
                news corpus
                # documents: 600 vs. 6000
                # categories: 6 vs. 12,
                Frequency distribution of classes: ([U]niform vs.
                [R]andom).
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   33
  Experiments



      Using Keyphrases as Query Set
      Compare clustering results based on different query sets

          1     ‘bag-of-words’: single words as queries
          2     keyphrases automatically extracted as head-noun phrases,
                single query = all keyphrases of a document

      Test collections:
                4 test collections assembled from the RCV1 (Reuters)
                news corpus
                # documents: 600 vs. 6000
                # categories: 6 vs. 12,
                Frequency distribution of classes: ([U]niform vs.
                [R]andom).
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis            34
  Experiments



      Using Keyphrases as Query Set - Results




           Average Precision                                         (External) F-measure
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   35
  Experiments



      Evaluation of the Expected F-Measure



      Correlation between expected F-Measure (internal measure)
      and
      standard F-measure (comparison with reference classification)
                test collections as before
                regard quality of 40 different clustering methods for each
                setting
                (find optimum clustering among these 40 methods)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   36
  Experiments



      Correlation results
      Pearson correlation between internal measures and the
      external F-Measure
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   37
  Conclusion and Outlook




       1    Introduction

       2    Cluster Metric

       3    Optimum clustering

       4    Towards Optimum Clustering

       5    Experiments

       6    Conclusion and Outlook
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   38
  Conclusion and Outlook



      Summary




      Optimum Clustering Framework
              makes Cluster Hypothesis a requirement
              forms theoretical basis for development of better clustering
              methods
              yields positive experimental evidence
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   39
  Conclusion and Outlook



      Further Research



      theoretical
          compatibility of existing clustering methods with OCF
              extension of OCF to soft clustering
              extension of OCF to hierarchical clustering

      experimental
              variation of query sets
              user experiments

Más contenido relacionado

La actualidad más candente

A Signature Scheme as Secure as the Diffie Hellman Problem
A Signature Scheme as Secure as the Diffie Hellman ProblemA Signature Scheme as Secure as the Diffie Hellman Problem
A Signature Scheme as Secure as the Diffie Hellman Problemvsubhashini
 
PAC Bayesian for Deep Learning
PAC Bayesian for Deep LearningPAC Bayesian for Deep Learning
PAC Bayesian for Deep LearningMark Chang
 
PAC-Bayesian Bound for Deep Learning
PAC-Bayesian Bound for Deep LearningPAC-Bayesian Bound for Deep Learning
PAC-Bayesian Bound for Deep LearningMark Chang
 
A Short Course in Data Stream Mining
A Short Course in Data Stream MiningA Short Course in Data Stream Mining
A Short Course in Data Stream MiningAlbert Bifet
 
Mining Frequent Closed Graphs on Evolving Data Streams
Mining Frequent Closed Graphs on Evolving Data StreamsMining Frequent Closed Graphs on Evolving Data Streams
Mining Frequent Closed Graphs on Evolving Data StreamsAlbert Bifet
 
EuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOME
EuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOMEEuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOME
EuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOMEHONGJOO LEE
 
Workflow Allocations and Scheduling on IaaS Platforms, from Theory to Practice
Workflow Allocations and Scheduling on IaaS Platforms, from Theory to PracticeWorkflow Allocations and Scheduling on IaaS Platforms, from Theory to Practice
Workflow Allocations and Scheduling on IaaS Platforms, from Theory to PracticeFrederic Desprez
 
Information in the Weights
Information in the WeightsInformation in the Weights
Information in the WeightsMark Chang
 
Information in the Weights
Information in the WeightsInformation in the Weights
Information in the WeightsMark Chang
 
Speaker Diarization
Speaker DiarizationSpeaker Diarization
Speaker DiarizationHONGJOO LEE
 

La actualidad más candente (14)

A Signature Scheme as Secure as the Diffie Hellman Problem
A Signature Scheme as Secure as the Diffie Hellman ProblemA Signature Scheme as Secure as the Diffie Hellman Problem
A Signature Scheme as Secure as the Diffie Hellman Problem
 
PAC Bayesian for Deep Learning
PAC Bayesian for Deep LearningPAC Bayesian for Deep Learning
PAC Bayesian for Deep Learning
 
PAC-Bayesian Bound for Deep Learning
PAC-Bayesian Bound for Deep LearningPAC-Bayesian Bound for Deep Learning
PAC-Bayesian Bound for Deep Learning
 
Scalable Link Discovery for Modern Data-Driven Applications
Scalable Link Discovery for Modern Data-Driven ApplicationsScalable Link Discovery for Modern Data-Driven Applications
Scalable Link Discovery for Modern Data-Driven Applications
 
A Short Course in Data Stream Mining
A Short Course in Data Stream MiningA Short Course in Data Stream Mining
A Short Course in Data Stream Mining
 
Lec 5-nn-slides
Lec 5-nn-slidesLec 5-nn-slides
Lec 5-nn-slides
 
Mining Frequent Closed Graphs on Evolving Data Streams
Mining Frequent Closed Graphs on Evolving Data StreamsMining Frequent Closed Graphs on Evolving Data Streams
Mining Frequent Closed Graphs on Evolving Data Streams
 
EuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOME
EuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOMEEuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOME
EuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOME
 
Sara el hassad
Sara el hassadSara el hassad
Sara el hassad
 
Workflow Allocations and Scheduling on IaaS Platforms, from Theory to Practice
Workflow Allocations and Scheduling on IaaS Platforms, from Theory to PracticeWorkflow Allocations and Scheduling on IaaS Platforms, from Theory to Practice
Workflow Allocations and Scheduling on IaaS Platforms, from Theory to Practice
 
Information in the Weights
Information in the WeightsInformation in the Weights
Information in the Weights
 
Information in the Weights
Information in the WeightsInformation in the Weights
Information in the Weights
 
Speaker Diarization
Speaker DiarizationSpeaker Diarization
Speaker Diarization
 
cdrw
cdrwcdrw
cdrw
 

Destacado

Document Classification and Clustering
Document Classification and ClusteringDocument Classification and Clustering
Document Classification and ClusteringAnkur Shrivastava
 
Text clustering
Text clusteringText clustering
Text clusteringKU Leuven
 
Document clustering and classification
Document clustering and classification Document clustering and classification
Document clustering and classification Mahmoud Alfarra
 
Text categorization
Text categorizationText categorization
Text categorizationKU Leuven
 
K-Means, its Variants and its Applications
K-Means, its Variants and its ApplicationsK-Means, its Variants and its Applications
K-Means, its Variants and its ApplicationsVarad Meru
 
Self-Organising Maps for Customer Segmentation using R - Shane Lynn - Dublin R
Self-Organising Maps for Customer Segmentation using R - Shane Lynn - Dublin RSelf-Organising Maps for Customer Segmentation using R - Shane Lynn - Dublin R
Self-Organising Maps for Customer Segmentation using R - Shane Lynn - Dublin Rshanelynn
 
Customer Clustering For Retail Marketing
Customer Clustering For Retail MarketingCustomer Clustering For Retail Marketing
Customer Clustering For Retail MarketingJonathan Sedar
 
Textmining Retrieval And Clustering
Textmining Retrieval And ClusteringTextmining Retrieval And Clustering
Textmining Retrieval And ClusteringDataminingTools Inc
 
Application of Clustering in Data Science using Real-life Examples
Application of Clustering in Data Science using Real-life Examples Application of Clustering in Data Science using Real-life Examples
Application of Clustering in Data Science using Real-life Examples Edureka!
 

Destacado (11)

Document Classification and Clustering
Document Classification and ClusteringDocument Classification and Clustering
Document Classification and Clustering
 
Text clustering
Text clusteringText clustering
Text clustering
 
Document clustering and classification
Document clustering and classification Document clustering and classification
Document clustering and classification
 
Court Case Management System
Court Case Management SystemCourt Case Management System
Court Case Management System
 
Text categorization
Text categorizationText categorization
Text categorization
 
E courts project
E courts projectE courts project
E courts project
 
K-Means, its Variants and its Applications
K-Means, its Variants and its ApplicationsK-Means, its Variants and its Applications
K-Means, its Variants and its Applications
 
Self-Organising Maps for Customer Segmentation using R - Shane Lynn - Dublin R
Self-Organising Maps for Customer Segmentation using R - Shane Lynn - Dublin RSelf-Organising Maps for Customer Segmentation using R - Shane Lynn - Dublin R
Self-Organising Maps for Customer Segmentation using R - Shane Lynn - Dublin R
 
Customer Clustering For Retail Marketing
Customer Clustering For Retail MarketingCustomer Clustering For Retail Marketing
Customer Clustering For Retail Marketing
 
Textmining Retrieval And Clustering
Textmining Retrieval And ClusteringTextmining Retrieval And Clustering
Textmining Retrieval And Clustering
 
Application of Clustering in Data Science using Real-life Examples
Application of Clustering in Data Science using Real-life Examples Application of Clustering in Data Science using Real-life Examples
Application of Clustering in Data Science using Real-life Examples
 

Similar a The Optimum Clustering Framework: Implementing the Cluster Hypothesis

Document ranking using qprp with concept of multi dimensional subspace
Document ranking using qprp with concept of multi dimensional subspaceDocument ranking using qprp with concept of multi dimensional subspace
Document ranking using qprp with concept of multi dimensional subspacePrakash Dubey
 
Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial Intelligencevini89
 
lecture12-clustering.ppt
lecture12-clustering.pptlecture12-clustering.ppt
lecture12-clustering.pptImXaib
 
lecture12-clustering.ppt
lecture12-clustering.pptlecture12-clustering.ppt
lecture12-clustering.pptRajeshT305412
 
A+Novel+Approach+Based+On+Prototypes+And+Rough+Sets+For+Document+And+Feature+...
A+Novel+Approach+Based+On+Prototypes+And+Rough+Sets+For+Document+And+Feature+...A+Novel+Approach+Based+On+Prototypes+And+Rough+Sets+For+Document+And+Feature+...
A+Novel+Approach+Based+On+Prototypes+And+Rough+Sets+For+Document+And+Feature+...marxliouville
 
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494Sean Golliher
 
Iwsm2014 an analogy-based approach to estimation of software development ef...
Iwsm2014   an analogy-based approach to estimation of software development ef...Iwsm2014   an analogy-based approach to estimation of software development ef...
Iwsm2014 an analogy-based approach to estimation of software development ef...Nesma
 
Large Scale Data Clustering: an overview
Large Scale Data Clustering: an overviewLarge Scale Data Clustering: an overview
Large Scale Data Clustering: an overviewVahid Mirjalili
 
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...Salah Amean
 
lecture15-supervised.ppt
lecture15-supervised.pptlecture15-supervised.ppt
lecture15-supervised.pptIndra Hermawan
 
K-means Clustering
K-means ClusteringK-means Clustering
K-means ClusteringAnna Fensel
 
3.5 model based clustering
3.5 model based clustering3.5 model based clustering
3.5 model based clusteringKrish_ver2
 
Probablistic information retrieval
Probablistic information retrievalProbablistic information retrieval
Probablistic information retrievalNisha Arankandath
 

Similar a The Optimum Clustering Framework: Implementing the Cluster Hypothesis (20)

Document ranking using qprp with concept of multi dimensional subspace
Document ranking using qprp with concept of multi dimensional subspaceDocument ranking using qprp with concept of multi dimensional subspace
Document ranking using qprp with concept of multi dimensional subspace
 
Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial Intelligence
 
lecture12-clustering.ppt
lecture12-clustering.pptlecture12-clustering.ppt
lecture12-clustering.ppt
 
lecture12-clustering.ppt
lecture12-clustering.pptlecture12-clustering.ppt
lecture12-clustering.ppt
 
lecture12-clustering.ppt
lecture12-clustering.pptlecture12-clustering.ppt
lecture12-clustering.ppt
 
lecture12-clustering.ppt
lecture12-clustering.pptlecture12-clustering.ppt
lecture12-clustering.ppt
 
A+Novel+Approach+Based+On+Prototypes+And+Rough+Sets+For+Document+And+Feature+...
A+Novel+Approach+Based+On+Prototypes+And+Rough+Sets+For+Document+And+Feature+...A+Novel+Approach+Based+On+Prototypes+And+Rough+Sets+For+Document+And+Feature+...
A+Novel+Approach+Based+On+Prototypes+And+Rough+Sets+For+Document+And+Feature+...
 
cluster analysis
cluster analysiscluster analysis
cluster analysis
 
clustering.pptx
clustering.pptxclustering.pptx
clustering.pptx
 
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
 
Iwsm2014 an analogy-based approach to estimation of software development ef...
Iwsm2014   an analogy-based approach to estimation of software development ef...Iwsm2014   an analogy-based approach to estimation of software development ef...
Iwsm2014 an analogy-based approach to estimation of software development ef...
 
Clustering ppt
Clustering pptClustering ppt
Clustering ppt
 
Inex07
Inex07Inex07
Inex07
 
Large Scale Data Clustering: an overview
Large Scale Data Clustering: an overviewLarge Scale Data Clustering: an overview
Large Scale Data Clustering: an overview
 
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
 
lecture15-supervised.ppt
lecture15-supervised.pptlecture15-supervised.ppt
lecture15-supervised.ppt
 
K-means Clustering
K-means ClusteringK-means Clustering
K-means Clustering
 
3.5 model based clustering
3.5 model based clustering3.5 model based clustering
3.5 model based clustering
 
Probablistic information retrieval
Probablistic information retrievalProbablistic information retrieval
Probablistic information retrieval
 
Cluster
ClusterCluster
Cluster
 

Más de yaevents

Как научить роботов тестировать веб-интерфейсы. Артем Ерошенко, Илья Кацев, Я...
Как научить роботов тестировать веб-интерфейсы. Артем Ерошенко, Илья Кацев, Я...Как научить роботов тестировать веб-интерфейсы. Артем Ерошенко, Илья Кацев, Я...
Как научить роботов тестировать веб-интерфейсы. Артем Ерошенко, Илья Кацев, Я...yaevents
 
Тема для WordPress в БЭМ. Владимир Гриненко, Яндекс
Тема для WordPress в БЭМ. Владимир Гриненко, ЯндексТема для WordPress в БЭМ. Владимир Гриненко, Яндекс
Тема для WordPress в БЭМ. Владимир Гриненко, Яндексyaevents
 
Построение сложносоставных блоков в шаблонизаторе bemhtml. Сергей Бережной, Я...
Построение сложносоставных блоков в шаблонизаторе bemhtml. Сергей Бережной, Я...Построение сложносоставных блоков в шаблонизаторе bemhtml. Сергей Бережной, Я...
Построение сложносоставных блоков в шаблонизаторе bemhtml. Сергей Бережной, Я...yaevents
 
i-bem.js: JavaScript в БЭМ-терминах. Елена Глухова, Варвара Степанова, Яндекс
i-bem.js: JavaScript в БЭМ-терминах. Елена Глухова, Варвара Степанова, Яндексi-bem.js: JavaScript в БЭМ-терминах. Елена Глухова, Варвара Степанова, Яндекс
i-bem.js: JavaScript в БЭМ-терминах. Елена Глухова, Варвара Степанова, Яндексyaevents
 
Дом из готовых кирпичей. Библиотека блоков, тюнинг, инструменты. Елена Глухов...
Дом из готовых кирпичей. Библиотека блоков, тюнинг, инструменты. Елена Глухов...Дом из готовых кирпичей. Библиотека блоков, тюнинг, инструменты. Елена Глухов...
Дом из готовых кирпичей. Библиотека блоков, тюнинг, инструменты. Елена Глухов...yaevents
 
Модели в профессиональной инженерии и тестировании программ. Александр Петрен...
Модели в профессиональной инженерии и тестировании программ. Александр Петрен...Модели в профессиональной инженерии и тестировании программ. Александр Петрен...
Модели в профессиональной инженерии и тестировании программ. Александр Петрен...yaevents
 
Администрирование небольших сервисов или один за всех и 100 на одного. Роман ...
Администрирование небольших сервисов или один за всех и 100 на одного. Роман ...Администрирование небольших сервисов или один за всех и 100 на одного. Роман ...
Администрирование небольших сервисов или один за всех и 100 на одного. Роман ...yaevents
 
Мониторинг со всех сторон. Алексей Симаков, Яндекс
Мониторинг со всех сторон. Алексей Симаков, ЯндексМониторинг со всех сторон. Алексей Симаков, Яндекс
Мониторинг со всех сторон. Алексей Симаков, Яндексyaevents
 
Истории про разработку сайтов. Сергей Бережной, Яндекс
Истории про разработку сайтов. Сергей Бережной, ЯндексИстории про разработку сайтов. Сергей Бережной, Яндекс
Истории про разработку сайтов. Сергей Бережной, Яндексyaevents
 
Разработка приложений для Android на С++. Юрий Береза, Shturmann
Разработка приложений для Android на С++. Юрий Береза, ShturmannРазработка приложений для Android на С++. Юрий Береза, Shturmann
Разработка приложений для Android на С++. Юрий Береза, Shturmannyaevents
 
Кросс-платформенная разработка под мобильные устройства. Дмитрий Жестилевский...
Кросс-платформенная разработка под мобильные устройства. Дмитрий Жестилевский...Кросс-платформенная разработка под мобильные устройства. Дмитрий Жестилевский...
Кросс-платформенная разработка под мобильные устройства. Дмитрий Жестилевский...yaevents
 
Сложнейшие техники, применяемые буткитами и полиморфными вирусами. Вячеслав З...
Сложнейшие техники, применяемые буткитами и полиморфными вирусами. Вячеслав З...Сложнейшие техники, применяемые буткитами и полиморфными вирусами. Вячеслав З...
Сложнейшие техники, применяемые буткитами и полиморфными вирусами. Вячеслав З...yaevents
 
Сканирование уязвимостей со вкусом Яндекса. Тарас Иващенко, Яндекс
Сканирование уязвимостей со вкусом Яндекса. Тарас Иващенко, ЯндексСканирование уязвимостей со вкусом Яндекса. Тарас Иващенко, Яндекс
Сканирование уязвимостей со вкусом Яндекса. Тарас Иващенко, Яндексyaevents
 
Масштабируемость Hadoop в Facebook. Дмитрий Мольков, Facebook
Масштабируемость Hadoop в Facebook. Дмитрий Мольков, FacebookМасштабируемость Hadoop в Facebook. Дмитрий Мольков, Facebook
Масштабируемость Hadoop в Facebook. Дмитрий Мольков, Facebookyaevents
 
Контроль зверей: инструменты для управления и мониторинга распределенных сист...
Контроль зверей: инструменты для управления и мониторинга распределенных сист...Контроль зверей: инструменты для управления и мониторинга распределенных сист...
Контроль зверей: инструменты для управления и мониторинга распределенных сист...yaevents
 
Юнит-тестирование и Google Mock. Влад Лосев, Google
Юнит-тестирование и Google Mock. Влад Лосев, GoogleЮнит-тестирование и Google Mock. Влад Лосев, Google
Юнит-тестирование и Google Mock. Влад Лосев, Googleyaevents
 
C++11 (formerly known as C++0x) is the new C++ language standard. Dave Abraha...
C++11 (formerly known as C++0x) is the new C++ language standard. Dave Abraha...C++11 (formerly known as C++0x) is the new C++ language standard. Dave Abraha...
C++11 (formerly known as C++0x) is the new C++ language standard. Dave Abraha...yaevents
 
Зачем обычному программисту знать языки, на которых почти никто не пишет. Але...
Зачем обычному программисту знать языки, на которых почти никто не пишет. Але...Зачем обычному программисту знать языки, на которых почти никто не пишет. Але...
Зачем обычному программисту знать языки, на которых почти никто не пишет. Але...yaevents
 
В поисках математики. Михаил Денисенко, Нигма
В поисках математики. Михаил Денисенко, НигмаВ поисках математики. Михаил Денисенко, Нигма
В поисках математики. Михаил Денисенко, Нигмаyaevents
 
Using classifiers to compute similarities between face images. Prof. Lior Wol...
Using classifiers to compute similarities between face images. Prof. Lior Wol...Using classifiers to compute similarities between face images. Prof. Lior Wol...
Using classifiers to compute similarities between face images. Prof. Lior Wol...yaevents
 

Más de yaevents (20)

Как научить роботов тестировать веб-интерфейсы. Артем Ерошенко, Илья Кацев, Я...
Как научить роботов тестировать веб-интерфейсы. Артем Ерошенко, Илья Кацев, Я...Как научить роботов тестировать веб-интерфейсы. Артем Ерошенко, Илья Кацев, Я...
Как научить роботов тестировать веб-интерфейсы. Артем Ерошенко, Илья Кацев, Я...
 
Тема для WordPress в БЭМ. Владимир Гриненко, Яндекс
Тема для WordPress в БЭМ. Владимир Гриненко, ЯндексТема для WordPress в БЭМ. Владимир Гриненко, Яндекс
Тема для WordPress в БЭМ. Владимир Гриненко, Яндекс
 
Построение сложносоставных блоков в шаблонизаторе bemhtml. Сергей Бережной, Я...
Построение сложносоставных блоков в шаблонизаторе bemhtml. Сергей Бережной, Я...Построение сложносоставных блоков в шаблонизаторе bemhtml. Сергей Бережной, Я...
Построение сложносоставных блоков в шаблонизаторе bemhtml. Сергей Бережной, Я...
 
i-bem.js: JavaScript в БЭМ-терминах. Елена Глухова, Варвара Степанова, Яндекс
i-bem.js: JavaScript в БЭМ-терминах. Елена Глухова, Варвара Степанова, Яндексi-bem.js: JavaScript в БЭМ-терминах. Елена Глухова, Варвара Степанова, Яндекс
i-bem.js: JavaScript в БЭМ-терминах. Елена Глухова, Варвара Степанова, Яндекс
 
Дом из готовых кирпичей. Библиотека блоков, тюнинг, инструменты. Елена Глухов...
Дом из готовых кирпичей. Библиотека блоков, тюнинг, инструменты. Елена Глухов...Дом из готовых кирпичей. Библиотека блоков, тюнинг, инструменты. Елена Глухов...
Дом из готовых кирпичей. Библиотека блоков, тюнинг, инструменты. Елена Глухов...
 
Модели в профессиональной инженерии и тестировании программ. Александр Петрен...
Модели в профессиональной инженерии и тестировании программ. Александр Петрен...Модели в профессиональной инженерии и тестировании программ. Александр Петрен...
Модели в профессиональной инженерии и тестировании программ. Александр Петрен...
 
Администрирование небольших сервисов или один за всех и 100 на одного. Роман ...
Администрирование небольших сервисов или один за всех и 100 на одного. Роман ...Администрирование небольших сервисов или один за всех и 100 на одного. Роман ...
Администрирование небольших сервисов или один за всех и 100 на одного. Роман ...
 
Мониторинг со всех сторон. Алексей Симаков, Яндекс
Мониторинг со всех сторон. Алексей Симаков, ЯндексМониторинг со всех сторон. Алексей Симаков, Яндекс
Мониторинг со всех сторон. Алексей Симаков, Яндекс
 
Истории про разработку сайтов. Сергей Бережной, Яндекс
Истории про разработку сайтов. Сергей Бережной, ЯндексИстории про разработку сайтов. Сергей Бережной, Яндекс
Истории про разработку сайтов. Сергей Бережной, Яндекс
 
Разработка приложений для Android на С++. Юрий Береза, Shturmann
Разработка приложений для Android на С++. Юрий Береза, ShturmannРазработка приложений для Android на С++. Юрий Береза, Shturmann
Разработка приложений для Android на С++. Юрий Береза, Shturmann
 
Кросс-платформенная разработка под мобильные устройства. Дмитрий Жестилевский...
Кросс-платформенная разработка под мобильные устройства. Дмитрий Жестилевский...Кросс-платформенная разработка под мобильные устройства. Дмитрий Жестилевский...
Кросс-платформенная разработка под мобильные устройства. Дмитрий Жестилевский...
 
Сложнейшие техники, применяемые буткитами и полиморфными вирусами. Вячеслав З...
Сложнейшие техники, применяемые буткитами и полиморфными вирусами. Вячеслав З...Сложнейшие техники, применяемые буткитами и полиморфными вирусами. Вячеслав З...
Сложнейшие техники, применяемые буткитами и полиморфными вирусами. Вячеслав З...
 
Сканирование уязвимостей со вкусом Яндекса. Тарас Иващенко, Яндекс
Сканирование уязвимостей со вкусом Яндекса. Тарас Иващенко, ЯндексСканирование уязвимостей со вкусом Яндекса. Тарас Иващенко, Яндекс
Сканирование уязвимостей со вкусом Яндекса. Тарас Иващенко, Яндекс
 
Масштабируемость Hadoop в Facebook. Дмитрий Мольков, Facebook
Масштабируемость Hadoop в Facebook. Дмитрий Мольков, FacebookМасштабируемость Hadoop в Facebook. Дмитрий Мольков, Facebook
Масштабируемость Hadoop в Facebook. Дмитрий Мольков, Facebook
 
Контроль зверей: инструменты для управления и мониторинга распределенных сист...
Контроль зверей: инструменты для управления и мониторинга распределенных сист...Контроль зверей: инструменты для управления и мониторинга распределенных сист...
Контроль зверей: инструменты для управления и мониторинга распределенных сист...
 
Юнит-тестирование и Google Mock. Влад Лосев, Google
Юнит-тестирование и Google Mock. Влад Лосев, GoogleЮнит-тестирование и Google Mock. Влад Лосев, Google
Юнит-тестирование и Google Mock. Влад Лосев, Google
 
C++11 (formerly known as C++0x) is the new C++ language standard. Dave Abraha...
C++11 (formerly known as C++0x) is the new C++ language standard. Dave Abraha...C++11 (formerly known as C++0x) is the new C++ language standard. Dave Abraha...
C++11 (formerly known as C++0x) is the new C++ language standard. Dave Abraha...
 
Зачем обычному программисту знать языки, на которых почти никто не пишет. Але...
Зачем обычному программисту знать языки, на которых почти никто не пишет. Але...Зачем обычному программисту знать языки, на которых почти никто не пишет. Але...
Зачем обычному программисту знать языки, на которых почти никто не пишет. Але...
 
В поисках математики. Михаил Денисенко, Нигма
В поисках математики. Михаил Денисенко, НигмаВ поисках математики. Михаил Денисенко, Нигма
В поисках математики. Михаил Денисенко, Нигма
 
Using classifiers to compute similarities between face images. Prof. Lior Wol...
Using classifiers to compute similarities between face images. Prof. Lior Wol...Using classifiers to compute similarities between face images. Prof. Lior Wol...
Using classifiers to compute similarities between face images. Prof. Lior Wol...
 

Último

Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 

Último (20)

Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 

The Optimum Clustering Framework: Implementing the Cluster Hypothesis

  • 1. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis Norbert Fuhr University of Duisburg-Essen March 30, 2011
  • 2. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 2 Outline 1 Introduction 2 Cluster Metric 3 Optimum clustering 4 Towards Optimum Clustering 5 Experiments 6 Conclusion and Outlook
  • 3. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 3 Introduction 1 Introduction 2 Cluster Metric 3 Optimum clustering 4 Towards Optimum Clustering 5 Experiments 6 Conclusion and Outlook
  • 4. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 4 Introduction Motivation Ad-hoc Retrieval heuristic models: define retrieval function evaluate to test if it yields good quality Probability Ranking Principle (PRP) theoretic foundation for optimum retrieval numerous probabilistic models based on PRP Document clustering classic approach: define similarity function and fusion principle evaluate to test if they yield good quality Optimum Clustering Principle?
  • 5. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 4 Introduction Motivation Ad-hoc Retrieval heuristic models: define retrieval function evaluate to test if it yields good quality Probability Ranking Principle (PRP) theoretic foundation for optimum retrieval numerous probabilistic models based on PRP Document clustering classic approach: define similarity function and fusion principle evaluate to test if they yield good quality Optimum Clustering Principle?
  • 6. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 4 Introduction Motivation Ad-hoc Retrieval heuristic models: define retrieval function evaluate to test if it yields good quality Probability Ranking Principle (PRP) theoretic foundation for optimum retrieval numerous probabilistic models based on PRP Document clustering classic approach: define similarity function and fusion principle evaluate to test if they yield good quality Optimum Clustering Principle?
  • 7. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 5 Introduction Cluster Hypothesis Original Formulation ”closely associated documents tend to be relevant to the same requests” (Rijsbergen 1979) Idea of optimum clustering: Cluster documents in such a way, that for any request, the relevant documents occur together in one cluster redefine document similarity: documents are similar if they are relevant to the same queries
  • 8. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 5 Introduction Cluster Hypothesis Original Formulation ”closely associated documents tend to be relevant to the same requests” (Rijsbergen 1979) Idea of optimum clustering: Cluster documents in such a way, that for any request, the relevant documents occur together in one cluster redefine document similarity: documents are similar if they are relevant to the same queries
  • 9. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 5 Introduction Cluster Hypothesis Original Formulation ”closely associated documents tend to be relevant to the same requests” (Rijsbergen 1979) Idea of optimum clustering: Cluster documents in such a way, that for any request, the relevant documents occur together in one cluster redefine document similarity: documents are similar if they are relevant to the same queries
  • 10. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 6 Introduction The Optimum Clustering Framework
  • 11. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 6 Introduction The Optimum Clustering Framework
  • 12. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 6 Introduction The Optimum Clustering Framework
  • 13. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 6 Introduction The Optimum Clustering Framework
  • 14. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 7 Cluster Metric 1 Introduction 2 Cluster Metric 3 Optimum clustering 4 Towards Optimum Clustering 5 Experiments 6 Conclusion and Outlook
  • 15. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 8 Cluster Metric Defining a Metric based on the Cluster Hypothesis General idea: Evaluate clustering wrt. a set of queries For each query and each cluster, regard pairs of documents co-occurring: relevant-relevant: good relevant-irrelevant: bad irrelevant-irrelevant: don’t care
  • 16. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 9 Cluster Metric Pairwise precision Q Set of queries D Document collection R relevance judgments: R ⊂ Q × D C Clustering, C = {C1 , . . . , Cn } s.th. ∪n Ci = D and i=1 ∀i, j : i = j → Ci ∩ Cj = ∅ ci = |Ci | (size of cluster Ci ), rik = |{dm ∈ Ci |(qk , dm ) ∈ R}| (number of relevant documents in Ci wrt. qk ) Pairwise precision (weighted average over all clusters) 1 rik (rik − 1) Pp (D, Q, R, C) = ci |D| ci (ci − 1) Ci ∈C qk ∈Q ci >1
  • 17. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 9 Cluster Metric Pairwise precision Q Set of queries D Document collection R relevance judgments: R ⊂ Q × D C Clustering, C = {C1 , . . . , Cn } s.th. ∪n Ci = D and i=1 ∀i, j : i = j → Ci ∩ Cj = ∅ ci = |Ci | (size of cluster Ci ), rik = |{dm ∈ Ci |(qk , dm ) ∈ R}| (number of relevant documents in Ci wrt. qk ) Pairwise precision (weighted average over all clusters) 1 rik (rik − 1) Pp (D, Q, R, C) = ci |D| ci (ci − 1) Ci ∈C qk ∈Q ci >1
  • 18. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 9 Cluster Metric Pairwise precision Q Set of queries D Document collection R relevance judgments: R ⊂ Q × D C Clustering, C = {C1 , . . . , Cn } s.th. ∪n Ci = D and i=1 ∀i, j : i = j → Ci ∩ Cj = ∅ ci = |Ci | (size of cluster Ci ), rik = |{dm ∈ Ci |(qk , dm ) ∈ R}| (number of relevant documents in Ci wrt. qk ) Pairwise precision (weighted average over all clusters) 1 rik (rik − 1) Pp (D, Q, R, C) = ci |D| ci (ci − 1) Ci ∈C qk ∈Q ci >1
  • 19. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 9 Cluster Metric Pairwise precision Q Set of queries D Document collection R relevance judgments: R ⊂ Q × D C Clustering, C = {C1 , . . . , Cn } s.th. ∪n Ci = D and i=1 ∀i, j : i = j → Ci ∩ Cj = ∅ ci = |Ci | (size of cluster Ci ), rik = |{dm ∈ Ci |(qk , dm ) ∈ R}| (number of relevant documents in Ci wrt. qk ) Pairwise precision (weighted average over all clusters) 1 rik (rik − 1) Pp (D, Q, R, C) = ci |D| ci (ci − 1) Ci ∈C qk ∈Q ci >1
  • 20. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 9 Cluster Metric Pairwise precision Q Set of queries D Document collection R relevance judgments: R ⊂ Q × D C Clustering, C = {C1 , . . . , Cn } s.th. ∪n Ci = D and i=1 ∀i, j : i = j → Ci ∩ Cj = ∅ ci = |Ci | (size of cluster Ci ), rik = |{dm ∈ Ci |(qk , dm ) ∈ R}| (number of relevant documents in Ci wrt. qk ) Pairwise precision (weighted average over all clusters) 1 rik (rik − 1) Pp (D, Q, R, C) = ci |D| ci (ci − 1) Ci ∈C qk ∈Q ci >1
  • 21. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 10 Cluster Metric Pairwise precision – Example 1 rik (rik − 1) Pp (D, Q, R, C) = ci |D| ci (ci − 1) Ci ∈C qk ∈Q ci >1 Query set: disjoint classification with two classes a and b, three clusters: (aab|bb|aa) Pp = 1 (3( 1 + 0) + 2(0 + 1) + 2(1 + 0)) = 5 . 7 3 7 Perfect clustering for a disjoint classification would yield Pp = 1 for arbitrary query sets, values > 1 are possible
  • 22. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 10 Cluster Metric Pairwise precision – Example 1 rik (rik − 1) Pp (D, Q, R, C) = ci |D| ci (ci − 1) Ci ∈C qk ∈Q ci >1 Query set: disjoint classification with two classes a and b, three clusters: (aab|bb|aa) Pp = 1 (3( 1 + 0) + 2(0 + 1) + 2(1 + 0)) = 5 . 7 3 7 Perfect clustering for a disjoint classification would yield Pp = 1 for arbitrary query sets, values > 1 are possible
  • 23. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 11 Cluster Metric Pairwise recall rik = |{dm ∈ Ci |(qk , dm ) ∈ R}| (number of relevant documents in Ci wrt. qk ) gk = |{d ∈ D|(qk , d) ∈ R}| (number of relevant documents for qk ) (micro recall) qk ∈Q Ci ∈C rik (rik − 1) Rp (D, Q, R, C) = qk ∈Q gk (gk − 1) gk >1 Example: (aab|bb|aa) 2 a pairs (out of 6) 1 b pair (out of 3) 2+1 1 Rp = 6+3 = 3.
  • 24. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 11 Cluster Metric Pairwise recall rik = |{dm ∈ Ci |(qk , dm ) ∈ R}| (number of relevant documents in Ci wrt. qk ) gk = |{d ∈ D|(qk , d) ∈ R}| (number of relevant documents for qk ) (micro recall) qk ∈Q Ci ∈C rik (rik − 1) Rp (D, Q, R, C) = qk ∈Q gk (gk − 1) gk >1 Example: (aab|bb|aa) 2 a pairs (out of 6) 1 b pair (out of 3) 2+1 1 Rp = 6+3 = 3.
  • 25. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 12 Cluster Metric Perfect clustering C is a perfect clustering iff there exists no clustering C s.th. Pp (D, Q, R, C) < Pp (D, Q, R, C )∧ Rp (D, Q, R, C) < Rp (D, Q, R, C ) strong Pareto optimum – more than one perfect clustering possible Pp ({d1 , d2 , d3 }, {d4 , d5 }) = Pp ({d1 , d2 }, {d3 , d4 , d5 }) = 1, Example: Rp = 23 Pp ({d1 , d2 , d3 , d4 , d5 }) = 0.6, Rp = 1
  • 26. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 12 Cluster Metric Perfect clustering C is a perfect clustering iff there exists no clustering C s.th. Pp (D, Q, R, C) < Pp (D, Q, R, C )∧ Rp (D, Q, R, C) < Rp (D, Q, R, C ) strong Pareto optimum – more than one perfect clustering possible Pp ({d1 , d2 , d3 }, {d4 , d5 }) = Pp ({d1 , d2 }, {d3 , d4 , d5 }) = 1, Example: Rp = 23 Pp ({d1 , d2 , d3 , d4 , d5 }) = 0.6, Rp = 1
  • 27. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 12 Cluster Metric Perfect clustering C is a perfect clustering iff there exists no clustering C s.th. Pp (D, Q, R, C) < Pp (D, Q, R, C )∧ Rp (D, Q, R, C) < Rp (D, Q, R, C ) strong Pareto optimum – more than one perfect clustering possible Example: Pp ({d1 , d2 , d3 }, {d4 , d5 }) = Pp ({d1 , d2 }, {d3 , d4 , d5 }) = 1, Rp = 23 Pp ({d1 , d2 , d3 , d4 , d5 }) = 0.6, Rp = 1
  • 28. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 13 Cluster Metric Do perfect clusterings form a hierarchy?
  • 29. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 13 Cluster Metric Do perfect clusterings form a hierarchy? Pp 1 C 1 Rp C = {{d1 , d2 , d3 , d4 }}
  • 30. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 13 Cluster Metric Do perfect clusterings form a hierarchy? Pp 1 C’ C 1 Rp C = {{d1 , d2 , d3 , d4 }} C = {{d1 , d2 }, {d3 , d4 }}
  • 31. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 13 Cluster Metric Do perfect clusterings form a hierarchy? Pp 1 C’ C’’ C 1 Rp C = {{d1 , d2 , d3 , d4 }} C = {{d1 , d2 }, {d3 , d4 }} C = {{d1 , d2 , d3 }, {d4 }}
  • 32. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 14 Optimum clustering 1 Introduction 2 Cluster Metric 3 Optimum clustering 4 Towards Optimum Clustering 5 Experiments 6 Conclusion and Outlook
  • 33. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 15 Optimum clustering Optimum Clustering Usually, clustering process has no knowledge about relevance judgments switch from external to internal cluster measures replace relevance judgments by estimates of probability of relevance requires probabilistic retrieval method yielding P(rel|q, d) compute expected cluster quality
  • 34. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 15 Optimum clustering Optimum Clustering Usually, clustering process has no knowledge about relevance judgments switch from external to internal cluster measures replace relevance judgments by estimates of probability of relevance requires probabilistic retrieval method yielding P(rel|q, d) compute expected cluster quality
  • 35. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 15 Optimum clustering Optimum Clustering Usually, clustering process has no knowledge about relevance judgments switch from external to internal cluster measures replace relevance judgments by estimates of probability of relevance requires probabilistic retrieval method yielding P(rel|q, d) compute expected cluster quality
  • 36. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 15 Optimum clustering Optimum Clustering Usually, clustering process has no knowledge about relevance judgments switch from external to internal cluster measures replace relevance judgments by estimates of probability of relevance requires probabilistic retrieval method yielding P(rel|q, d) compute expected cluster quality
  • 37. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 15 Optimum clustering Optimum Clustering Usually, clustering process has no knowledge about relevance judgments switch from external to internal cluster measures replace relevance judgments by estimates of probability of relevance requires probabilistic retrieval method yielding P(rel|q, d) compute expected cluster quality
  • 38. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 16 Optimum clustering Expected cluster quality Pairwise precision: 1 rik (rik − 1) Pp (D, Q, R, C) = ci |D| ci (ci − 1) Ci ∈C qk ∈Q ci >1 Expected precision: 1 ci π(D, Q, C) = ! P(rel|qk , dl )P(rel|qk , dm ) |D| ci (ci − 1) Ci ∈C qk ∈Q (dl ,dm )∈Ci ×Ci |Ci |>1 dl =dm
  • 39. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 16 Optimum clustering Expected cluster quality Pairwise precision: 1 rik (rik − 1) Pp (D, Q, R, C) = ci |D| ci (ci − 1) Ci ∈C qk ∈Q ci >1 Expected precision: 1 ci π(D, Q, C) = ! P(rel|qk , dl )P(rel|qk , dm ) |D| ci (ci − 1) Ci ∈C qk ∈Q (dl ,dm )∈Ci ×Ci |Ci |>1 dl =dm
  • 40. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 17 Optimum clustering Expected precision 1 ci π(D, Q, C) = P(rel|qk , dl )P(rel|qk , dm ) |D| ci (ci − 1) Ci ∈C (dl ,dm )∈Ci ×Ci qk ∈Q |Ci |>1 dl =dm here qk ∈Q P(rel|qk , dl )P(rel|qk , dm ) gives the expected number of queries for which both dl and dm are relevant Transform a document into a vector of relevance probabilities: τ T (dm ) = (P(rel|q1 , dm ), P(rel|q2 , dm ), . . . , P(rel|q|Q| , dm )). 1 1 π(D, Q, C) = τ T (dl ) · τ (dm ) |D| Ci ∈C ci − 1 (dl ,dm )∈Ci ×Ci |Ci |>1 dl =dm
  • 41. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 17 Optimum clustering Expected precision 1 ci π(D, Q, C) = P(rel|qk , dl )P(rel|qk , dm ) |D| ci (ci − 1) Ci ∈C (dl ,dm )∈Ci ×Ci qk ∈Q |Ci |>1 dl =dm here qk ∈Q P(rel|qk , dl )P(rel|qk , dm ) gives the expected number of queries for which both dl and dm are relevant Transform a document into a vector of relevance probabilities: τ T (dm ) = (P(rel|q1 , dm ), P(rel|q2 , dm ), . . . , P(rel|q|Q| , dm )). 1 1 π(D, Q, C) = τ T (dl ) · τ (dm ) |D| Ci ∈C ci − 1 (dl ,dm )∈Ci ×Ci |Ci |>1 dl =dm
  • 42. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 18 Optimum clustering Expected recall qk ∈Q Ci ∈C rik (rik − 1) Rp (D, Q, R, C) = qk ∈Q gk (gk − 1) gk >1 Direct estimation requires estimation of denominator → biased estimates But: denominator is constant for a given query set → ignore compute an estimate for the numerator only: ρ(D, Q, C) = τ T (dl ) · τ (dm ) Ci ∈C (dl ,dm )∈Ci ×Ci dl =dm (Scalar product τ T (dl ) · τ (dm ) gives the expected number of queries for which both dl and dm are relevant)
  • 43. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 18 Optimum clustering Expected recall qk ∈Q Ci ∈C rik (rik − 1) Rp (D, Q, R, C) = qk ∈Q gk (gk − 1) gk >1 Direct estimation requires estimation of denominator → biased estimates But: denominator is constant for a given query set → ignore compute an estimate for the numerator only: ρ(D, Q, C) = τ T (dl ) · τ (dm ) Ci ∈C (dl ,dm )∈Ci ×Ci dl =dm (Scalar product τ T (dl ) · τ (dm ) gives the expected number of queries for which both dl and dm are relevant)
  • 44. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 18 Optimum clustering Expected recall qk ∈Q Ci ∈C rik (rik − 1) Rp (D, Q, R, C) = qk ∈Q gk (gk − 1) gk >1 Direct estimation requires estimation of denominator → biased estimates But: denominator is constant for a given query set → ignore compute an estimate for the numerator only: ρ(D, Q, C) = τ T (dl ) · τ (dm ) Ci ∈C (dl ,dm )∈Ci ×Ci dl =dm (Scalar product τ T (dl ) · τ (dm ) gives the expected number of queries for which both dl and dm are relevant)
  • 45. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 18 Optimum clustering Expected recall qk ∈Q Ci ∈C rik (rik − 1) Rp (D, Q, R, C) = qk ∈Q gk (gk − 1) gk >1 Direct estimation requires estimation of denominator → biased estimates But: denominator is constant for a given query set → ignore compute an estimate for the numerator only: ρ(D, Q, C) = τ T (dl ) · τ (dm ) Ci ∈C (dl ,dm )∈Ci ×Ci dl =dm (Scalar product τ T (dl ) · τ (dm ) gives the expected number of queries for which both dl and dm are relevant)
  • 46. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 19 Optimum clustering Optimum clustering C is an optimum clustering iff there exists no clustering C s.th. π(D, Q, C) < π(D, Q, C ) ∧ ρ(D, Q, C) < ρ(D, Q, C ) Pareto optima Set of perfect (and optimum) clusterings not even forms a cluster hierarchy no hierarchic clustering method will find all optima!
  • 47. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 19 Optimum clustering Optimum clustering C is an optimum clustering iff there exists no clustering C s.th. π(D, Q, C) < π(D, Q, C ) ∧ ρ(D, Q, C) < ρ(D, Q, C ) Pareto optima Set of perfect (and optimum) clusterings not even forms a cluster hierarchy no hierarchic clustering method will find all optima!
  • 48. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 19 Optimum clustering Optimum clustering C is an optimum clustering iff there exists no clustering C s.th. π(D, Q, C) < π(D, Q, C ) ∧ ρ(D, Q, C) < ρ(D, Q, C ) Pareto optima Set of perfect (and optimum) clusterings not even forms a cluster hierarchy no hierarchic clustering method will find all optima!
  • 49. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 19 Optimum clustering Optimum clustering C is an optimum clustering iff there exists no clustering C s.th. π(D, Q, C) < π(D, Q, C ) ∧ ρ(D, Q, C) < ρ(D, Q, C ) Pareto optima Set of perfect (and optimum) clusterings not even forms a cluster hierarchy no hierarchic clustering method will find all optima!
  • 50. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 20 Towards Optimum Clustering 1 Introduction 2 Cluster Metric 3 Optimum clustering 4 Towards Optimum Clustering 5 Experiments 6 Conclusion and Outlook
  • 51. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 21 Towards Optimum Clustering Towards Optimum Clustering Development of an (optimum) clustering method 1 Set of queries, 2 Probabilistic retrieval method, 3 Document similarity metric, and 4 Fusion principle.
  • 52. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 22 Towards Optimum Clustering A simple application 1 Set of queries: all possible one-term queries 2 Probabilistic retrieval method: tf ∗ idf 3 Document similarity metric: τ T (dl ) · τ (dm ) 4 Fusion principle: group average clustering 1 π(D, Q, C) = τ T (dl ) · τ (dm ) c(c − 1) (dl ,dm )∈C×C dl =dm standard clustering method
  • 53. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 22 Towards Optimum Clustering A simple application 1 Set of queries: all possible one-term queries 2 Probabilistic retrieval method: tf ∗ idf 3 Document similarity metric: τ T (dl ) · τ (dm ) 4 Fusion principle: group average clustering 1 π(D, Q, C) = τ T (dl ) · τ (dm ) c(c − 1) (dl ,dm )∈C×C dl =dm standard clustering method
  • 54. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 22 Towards Optimum Clustering A simple application 1 Set of queries: all possible one-term queries 2 Probabilistic retrieval method: tf ∗ idf 3 Document similarity metric: τ T (dl ) · τ (dm ) 4 Fusion principle: group average clustering 1 π(D, Q, C) = τ T (dl ) · τ (dm ) c(c − 1) (dl ,dm )∈C×C dl =dm standard clustering method
  • 55. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 22 Towards Optimum Clustering A simple application 1 Set of queries: all possible one-term queries 2 Probabilistic retrieval method: tf ∗ idf 3 Document similarity metric: τ T (dl ) · τ (dm ) 4 Fusion principle: group average clustering 1 π(D, Q, C) = τ T (dl ) · τ (dm ) c(c − 1) (dl ,dm )∈C×C dl =dm standard clustering method
  • 56. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 22 Towards Optimum Clustering A simple application 1 Set of queries: all possible one-term queries 2 Probabilistic retrieval method: tf ∗ idf 3 Document similarity metric: τ T (dl ) · τ (dm ) 4 Fusion principle: group average clustering 1 π(D, Q, C) = τ T (dl ) · τ (dm ) c(c − 1) (dl ,dm )∈C×C dl =dm standard clustering method
  • 57. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 22 Towards Optimum Clustering A simple application 1 Set of queries: all possible one-term queries 2 Probabilistic retrieval method: tf ∗ idf 3 Document similarity metric: τ T (dl ) · τ (dm ) 4 Fusion principle: group average clustering 1 π(D, Q, C) = τ T (dl ) · τ (dm ) c(c − 1) (dl ,dm )∈C×C dl =dm standard clustering method
  • 58. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 22 Towards Optimum Clustering A simple application 1 Set of queries: all possible one-term queries 2 Probabilistic retrieval method: tf ∗ idf 3 Document similarity metric: τ T (dl ) · τ (dm ) 4 Fusion principle: group average clustering 1 π(D, Q, C) = τ T (dl ) · τ (dm ) c(c − 1) (dl ,dm )∈C×C dl =dm standard clustering method
  • 59. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 23 Towards Optimum Clustering Query set Too few queries in real collections → artificial query set collection clustering: set of all possible one-term queries Probability distribution over the query set: uniform / proportional to doc. freq. Document representation: original terms / transformations of the term space Semantic dimensions: focus on certain aspects only (e.g. images: color, contour, texture) result clustering: set of all query expansions
  • 60. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 24 Towards Optimum Clustering Probabilistic retrieval method Model: In principle, any retrieval model suitable Transformation to probabilities: direct estimation / transforming the retrieval score into such a probability
  • 61. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 25 Towards Optimum Clustering Document similarity metric. fixed as τ T (dl ) · τ (dm )
  • 62. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 26 Towards Optimum Clustering Fusion principles OCF only gives guidelines for good fusion principles: consider metrics π and/or ρ during fusion
  • 63. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 27 Towards Optimum Clustering Group average clustering: 1 σ(C) = τ T (dl ) · τ (dm ) c(c − 1) (dl ,dm )∈C×C dl =dm expected precision as criterion! starts with singleton clusters minimum recall building larger clusters for increasing recall forms cluster with highest precision (which may be lower than that of the current clusters)
  • 64. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 27 Towards Optimum Clustering Group average clustering: 1 σ(C) = τ T (dl ) · τ (dm ) c(c − 1) (dl ,dm )∈C×C dl =dm expected precision as criterion! starts with singleton clusters minimum recall building larger clusters for increasing recall forms cluster with highest precision (which may be lower than that of the current clusters)
  • 65. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 27 Towards Optimum Clustering Group average clustering: 1 σ(C) = τ T (dl ) · τ (dm ) c(c − 1) (dl ,dm )∈C×C dl =dm expected precision as criterion! starts with singleton clusters minimum recall building larger clusters for increasing recall forms cluster with highest precision (which may be lower than that of the current clusters)
  • 66. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 27 Towards Optimum Clustering Group average clustering: 1 σ(C) = τ T (dl ) · τ (dm ) c(c − 1) (dl ,dm )∈C×C dl =dm expected precision as criterion! starts with singleton clusters minimum recall building larger clusters for increasing recall forms cluster with highest precision (which may be lower than that of the current clusters)
  • 67. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 28 Towards Optimum Clustering Fusion principles – min cut starts with single cluster (maximum recall) searches for cut with minimum loss in recall ρ(D, Q, C) = τ T (dl ) · τ (dm ) Ci ∈C (dl ,dm )∈C×C dl =dm consider expected precision for breaking ties!
  • 68. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 28 Towards Optimum Clustering Fusion principles – min cut starts with single cluster (maximum recall) searches for cut with minimum loss in recall ρ(D, Q, C) = τ T (dl ) · τ (dm ) Ci ∈C (dl ,dm )∈C×C dl =dm consider expected precision for breaking ties!
  • 69. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 28 Towards Optimum Clustering Fusion principles – min cut starts with single cluster (maximum recall) searches for cut with minimum loss in recall ρ(D, Q, C) = τ T (dl ) · τ (dm ) Ci ∈C (dl ,dm )∈C×C dl =dm consider expected precision for breaking ties!
  • 70. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 29 Towards Optimum Clustering Finding optimum clusterings Min cut (assume cohesive similarity graph) starts with optimum clustering for maximum recall min cut finds split with minimum loss in recall consider precision for tie breaking optimum clustering for two clusters O(n3 ) (vs. O(2n ) for the general case) subsequent splits will not necessarily reach optima Group average in general, multiple fusion steps for reaching first optimum greedy strategy does not necessarily find this optimum!
  • 71. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 29 Towards Optimum Clustering Finding optimum clusterings Min cut (assume cohesive similarity graph) starts with optimum clustering for maximum recall min cut finds split with minimum loss in recall consider precision for tie breaking optimum clustering for two clusters O(n3 ) (vs. O(2n ) for the general case) subsequent splits will not necessarily reach optima Group average in general, multiple fusion steps for reaching first optimum greedy strategy does not necessarily find this optimum!
  • 72. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 29 Towards Optimum Clustering Finding optimum clusterings Min cut (assume cohesive similarity graph) starts with optimum clustering for maximum recall min cut finds split with minimum loss in recall consider precision for tie breaking optimum clustering for two clusters O(n3 ) (vs. O(2n ) for the general case) subsequent splits will not necessarily reach optima Group average in general, multiple fusion steps for reaching first optimum greedy strategy does not necessarily find this optimum!
  • 73. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 29 Towards Optimum Clustering Finding optimum clusterings Min cut (assume cohesive similarity graph) starts with optimum clustering for maximum recall min cut finds split with minimum loss in recall consider precision for tie breaking optimum clustering for two clusters O(n3 ) (vs. O(2n ) for the general case) subsequent splits will not necessarily reach optima Group average in general, multiple fusion steps for reaching first optimum greedy strategy does not necessarily find this optimum!
  • 74. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 29 Towards Optimum Clustering Finding optimum clusterings Min cut (assume cohesive similarity graph) starts with optimum clustering for maximum recall min cut finds split with minimum loss in recall consider precision for tie breaking optimum clustering for two clusters O(n3 ) (vs. O(2n ) for the general case) subsequent splits will not necessarily reach optima Group average in general, multiple fusion steps for reaching first optimum greedy strategy does not necessarily find this optimum!
  • 75. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 29 Towards Optimum Clustering Finding optimum clusterings Min cut (assume cohesive similarity graph) starts with optimum clustering for maximum recall min cut finds split with minimum loss in recall consider precision for tie breaking optimum clustering for two clusters O(n3 ) (vs. O(2n ) for the general case) subsequent splits will not necessarily reach optima Group average in general, multiple fusion steps for reaching first optimum greedy strategy does not necessarily find this optimum!
  • 76. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 29 Towards Optimum Clustering Finding optimum clusterings Min cut (assume cohesive similarity graph) starts with optimum clustering for maximum recall min cut finds split with minimum loss in recall consider precision for tie breaking optimum clustering for two clusters O(n3 ) (vs. O(2n ) for the general case) subsequent splits will not necessarily reach optima Group average in general, multiple fusion steps for reaching first optimum greedy strategy does not necessarily find this optimum!
  • 77. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 29 Towards Optimum Clustering Finding optimum clusterings Min cut (assume cohesive similarity graph) starts with optimum clustering for maximum recall min cut finds split with minimum loss in recall consider precision for tie breaking optimum clustering for two clusters O(n3 ) (vs. O(2n ) for the general case) subsequent splits will not necessarily reach optima Group average in general, multiple fusion steps for reaching first optimum greedy strategy does not necessarily find this optimum!
  • 78. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 29 Towards Optimum Clustering Finding optimum clusterings Min cut (assume cohesive similarity graph) starts with optimum clustering for maximum recall min cut finds split with minimum loss in recall consider precision for tie breaking optimum clustering for two clusters O(n3 ) (vs. O(2n ) for the general case) subsequent splits will not necessarily reach optima Group average in general, multiple fusion steps for reaching first optimum greedy strategy does not necessarily find this optimum!
  • 79. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 30 Experiments 1 Introduction 2 Cluster Metric 3 Optimum clustering 4 Towards Optimum Clustering 5 Experiments 6 Conclusion and Outlook
  • 80. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 31 Experiments Experiments with a Query Set ADI collection: 35 queries 70 documents (relevant to 2.4 queries on avg.) Experiments: Q35opt using the actual relevance in τ (d) Q35 BM25 estimates for the 35 queries 1Tuni 1-term queries, uniform distribution 1Tdf 1-term queries, according to document frequency
  • 81. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 32 Experiments 2.5 Q35opt Q35 2 1Tuni 1Tdf Precision 1.5 1 0.5 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall
  • 82. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 33 Experiments Using Keyphrases as Query Set Compare clustering results based on different query sets 1 ‘bag-of-words’: single words as queries 2 keyphrases automatically extracted as head-noun phrases, single query = all keyphrases of a document Test collections: 4 test collections assembled from the RCV1 (Reuters) news corpus # documents: 600 vs. 6000 # categories: 6 vs. 12, Frequency distribution of classes: ([U]niform vs. [R]andom).
  • 83. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 33 Experiments Using Keyphrases as Query Set Compare clustering results based on different query sets 1 ‘bag-of-words’: single words as queries 2 keyphrases automatically extracted as head-noun phrases, single query = all keyphrases of a document Test collections: 4 test collections assembled from the RCV1 (Reuters) news corpus # documents: 600 vs. 6000 # categories: 6 vs. 12, Frequency distribution of classes: ([U]niform vs. [R]andom).
  • 84. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 34 Experiments Using Keyphrases as Query Set - Results Average Precision (External) F-measure
  • 85. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 35 Experiments Evaluation of the Expected F-Measure Correlation between expected F-Measure (internal measure) and standard F-measure (comparison with reference classification) test collections as before regard quality of 40 different clustering methods for each setting (find optimum clustering among these 40 methods)
  • 86. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 36 Experiments Correlation results Pearson correlation between internal measures and the external F-Measure
  • 87. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 37 Conclusion and Outlook 1 Introduction 2 Cluster Metric 3 Optimum clustering 4 Towards Optimum Clustering 5 Experiments 6 Conclusion and Outlook
  • 88. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 38 Conclusion and Outlook Summary Optimum Clustering Framework makes Cluster Hypothesis a requirement forms theoretical basis for development of better clustering methods yields positive experimental evidence
  • 89. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 39 Conclusion and Outlook Further Research theoretical compatibility of existing clustering methods with OCF extension of OCF to soft clustering extension of OCF to hierarchical clustering experimental variation of query sets user experiments