DevoxxFR 2024 Reproducible Builds with Apache Maven
The Optimum Clustering Framework: Implementing the Cluster Hypothesis
1. A Framework for Optimum Document
Clustering:
Implementing the Cluster Hypothesis
Norbert Fuhr
University of Duisburg-Essen
March 30, 2011
2. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 2
Outline
1 Introduction
2 Cluster Metric
3 Optimum clustering
4 Towards Optimum Clustering
5 Experiments
6 Conclusion and Outlook
3. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 3
Introduction
1 Introduction
2 Cluster Metric
3 Optimum clustering
4 Towards Optimum Clustering
5 Experiments
6 Conclusion and Outlook
4. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 4
Introduction
Motivation
Ad-hoc Retrieval
heuristic models:
define retrieval function
evaluate to test if it yields good quality
Probability Ranking Principle (PRP)
theoretic foundation for optimum retrieval
numerous probabilistic models based on PRP
Document clustering
classic approach:
define similarity function and fusion principle
evaluate to test if they yield good quality
Optimum Clustering Principle?
5. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 4
Introduction
Motivation
Ad-hoc Retrieval
heuristic models:
define retrieval function
evaluate to test if it yields good quality
Probability Ranking Principle (PRP)
theoretic foundation for optimum retrieval
numerous probabilistic models based on PRP
Document clustering
classic approach:
define similarity function and fusion principle
evaluate to test if they yield good quality
Optimum Clustering Principle?
6. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 4
Introduction
Motivation
Ad-hoc Retrieval
heuristic models:
define retrieval function
evaluate to test if it yields good quality
Probability Ranking Principle (PRP)
theoretic foundation for optimum retrieval
numerous probabilistic models based on PRP
Document clustering
classic approach:
define similarity function and fusion principle
evaluate to test if they yield good quality
Optimum Clustering Principle?
7. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 5
Introduction
Cluster Hypothesis
Original Formulation
”closely associated documents tend to be relevant to the same
requests” (Rijsbergen 1979)
Idea of optimum clustering:
Cluster documents in such a way, that for any request, the
relevant documents occur together in one cluster
redefine document similarity:
documents are similar if they are relevant to the same queries
8. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 5
Introduction
Cluster Hypothesis
Original Formulation
”closely associated documents tend to be relevant to the same
requests” (Rijsbergen 1979)
Idea of optimum clustering:
Cluster documents in such a way, that for any request, the
relevant documents occur together in one cluster
redefine document similarity:
documents are similar if they are relevant to the same queries
9. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 5
Introduction
Cluster Hypothesis
Original Formulation
”closely associated documents tend to be relevant to the same
requests” (Rijsbergen 1979)
Idea of optimum clustering:
Cluster documents in such a way, that for any request, the
relevant documents occur together in one cluster
redefine document similarity:
documents are similar if they are relevant to the same queries
10. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 6
Introduction
The Optimum Clustering Framework
11. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 6
Introduction
The Optimum Clustering Framework
12. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 6
Introduction
The Optimum Clustering Framework
13. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 6
Introduction
The Optimum Clustering Framework
14. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 7
Cluster Metric
1 Introduction
2 Cluster Metric
3 Optimum clustering
4 Towards Optimum Clustering
5 Experiments
6 Conclusion and Outlook
15. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 8
Cluster Metric
Defining a Metric based on the Cluster Hypothesis
General idea:
Evaluate clustering wrt. a set of queries
For each query and each cluster, regard pairs of
documents co-occurring:
relevant-relevant: good
relevant-irrelevant: bad
irrelevant-irrelevant: don’t care
16. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 9
Cluster Metric
Pairwise precision
Q Set of queries
D Document collection
R relevance judgments: R ⊂ Q × D
C Clustering, C = {C1 , . . . , Cn } s.th. ∪n Ci = D and
i=1
∀i, j : i = j → Ci ∩ Cj = ∅
ci = |Ci | (size of cluster Ci ),
rik = |{dm ∈ Ci |(qk , dm ) ∈ R}| (number of relevant
documents in Ci wrt. qk )
Pairwise precision (weighted average over all clusters)
1 rik (rik − 1)
Pp (D, Q, R, C) = ci
|D| ci (ci − 1)
Ci ∈C qk ∈Q
ci >1
17. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 9
Cluster Metric
Pairwise precision
Q Set of queries
D Document collection
R relevance judgments: R ⊂ Q × D
C Clustering, C = {C1 , . . . , Cn } s.th. ∪n Ci = D and
i=1
∀i, j : i = j → Ci ∩ Cj = ∅
ci = |Ci | (size of cluster Ci ),
rik = |{dm ∈ Ci |(qk , dm ) ∈ R}| (number of relevant
documents in Ci wrt. qk )
Pairwise precision (weighted average over all clusters)
1 rik (rik − 1)
Pp (D, Q, R, C) = ci
|D| ci (ci − 1)
Ci ∈C qk ∈Q
ci >1
18. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 9
Cluster Metric
Pairwise precision
Q Set of queries
D Document collection
R relevance judgments: R ⊂ Q × D
C Clustering, C = {C1 , . . . , Cn } s.th. ∪n Ci = D and
i=1
∀i, j : i = j → Ci ∩ Cj = ∅
ci = |Ci | (size of cluster Ci ),
rik = |{dm ∈ Ci |(qk , dm ) ∈ R}| (number of relevant
documents in Ci wrt. qk )
Pairwise precision (weighted average over all clusters)
1 rik (rik − 1)
Pp (D, Q, R, C) = ci
|D| ci (ci − 1)
Ci ∈C qk ∈Q
ci >1
19. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 9
Cluster Metric
Pairwise precision
Q Set of queries
D Document collection
R relevance judgments: R ⊂ Q × D
C Clustering, C = {C1 , . . . , Cn } s.th. ∪n Ci = D and
i=1
∀i, j : i = j → Ci ∩ Cj = ∅
ci = |Ci | (size of cluster Ci ),
rik = |{dm ∈ Ci |(qk , dm ) ∈ R}| (number of relevant
documents in Ci wrt. qk )
Pairwise precision (weighted average over all clusters)
1 rik (rik − 1)
Pp (D, Q, R, C) = ci
|D| ci (ci − 1)
Ci ∈C qk ∈Q
ci >1
20. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 9
Cluster Metric
Pairwise precision
Q Set of queries
D Document collection
R relevance judgments: R ⊂ Q × D
C Clustering, C = {C1 , . . . , Cn } s.th. ∪n Ci = D and
i=1
∀i, j : i = j → Ci ∩ Cj = ∅
ci = |Ci | (size of cluster Ci ),
rik = |{dm ∈ Ci |(qk , dm ) ∈ R}| (number of relevant
documents in Ci wrt. qk )
Pairwise precision (weighted average over all clusters)
1 rik (rik − 1)
Pp (D, Q, R, C) = ci
|D| ci (ci − 1)
Ci ∈C qk ∈Q
ci >1
21. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 10
Cluster Metric
Pairwise precision – Example
1 rik (rik − 1)
Pp (D, Q, R, C) = ci
|D| ci (ci − 1)
Ci ∈C qk ∈Q
ci >1
Query set: disjoint classification with two classes a and b,
three clusters: (aab|bb|aa)
Pp = 1 (3( 1 + 0) + 2(0 + 1) + 2(1 + 0)) = 5 .
7 3 7
Perfect clustering for a disjoint classification would yield
Pp = 1
for arbitrary query sets, values > 1 are possible
22. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 10
Cluster Metric
Pairwise precision – Example
1 rik (rik − 1)
Pp (D, Q, R, C) = ci
|D| ci (ci − 1)
Ci ∈C qk ∈Q
ci >1
Query set: disjoint classification with two classes a and b,
three clusters: (aab|bb|aa)
Pp = 1 (3( 1 + 0) + 2(0 + 1) + 2(1 + 0)) = 5 .
7 3 7
Perfect clustering for a disjoint classification would yield
Pp = 1
for arbitrary query sets, values > 1 are possible
23. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 11
Cluster Metric
Pairwise recall
rik = |{dm ∈ Ci |(qk , dm ) ∈ R}| (number of relevant
documents in Ci wrt. qk )
gk = |{d ∈ D|(qk , d) ∈ R}| (number of relevant
documents for qk )
(micro recall)
qk ∈Q Ci ∈C rik (rik − 1)
Rp (D, Q, R, C) =
qk ∈Q gk (gk − 1)
gk >1
Example: (aab|bb|aa)
2 a pairs (out of 6)
1 b pair (out of 3)
2+1 1
Rp = 6+3 = 3.
24. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 11
Cluster Metric
Pairwise recall
rik = |{dm ∈ Ci |(qk , dm ) ∈ R}| (number of relevant
documents in Ci wrt. qk )
gk = |{d ∈ D|(qk , d) ∈ R}| (number of relevant
documents for qk )
(micro recall)
qk ∈Q Ci ∈C rik (rik − 1)
Rp (D, Q, R, C) =
qk ∈Q gk (gk − 1)
gk >1
Example: (aab|bb|aa)
2 a pairs (out of 6)
1 b pair (out of 3)
2+1 1
Rp = 6+3 = 3.
25. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 12
Cluster Metric
Perfect clustering
C is a perfect clustering iff there exists no clustering C s.th.
Pp (D, Q, R, C) < Pp (D, Q, R, C )∧
Rp (D, Q, R, C) < Rp (D, Q, R, C )
strong Pareto optimum – more than one perfect clustering
possible
Pp ({d1 , d2 , d3 }, {d4 , d5 }) =
Pp ({d1 , d2 }, {d3 , d4 , d5 }) = 1,
Example: Rp = 23
Pp ({d1 , d2 , d3 , d4 , d5 }) = 0.6,
Rp = 1
26. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 12
Cluster Metric
Perfect clustering
C is a perfect clustering iff there exists no clustering C s.th.
Pp (D, Q, R, C) < Pp (D, Q, R, C )∧
Rp (D, Q, R, C) < Rp (D, Q, R, C )
strong Pareto optimum – more than one perfect clustering
possible
Pp ({d1 , d2 , d3 }, {d4 , d5 }) =
Pp ({d1 , d2 }, {d3 , d4 , d5 }) = 1,
Example: Rp = 23
Pp ({d1 , d2 , d3 , d4 , d5 }) = 0.6,
Rp = 1
27. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 12
Cluster Metric
Perfect clustering
C is a perfect clustering iff there exists no clustering C s.th.
Pp (D, Q, R, C) < Pp (D, Q, R, C )∧
Rp (D, Q, R, C) < Rp (D, Q, R, C )
strong Pareto optimum – more than one perfect clustering
possible
Example:
Pp ({d1 , d2 , d3 }, {d4 , d5 }) =
Pp ({d1 , d2 }, {d3 , d4 , d5 }) = 1,
Rp = 23
Pp ({d1 , d2 , d3 , d4 , d5 }) = 0.6,
Rp = 1
28. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 13
Cluster Metric
Do perfect clusterings form a hierarchy?
29. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 13
Cluster Metric
Do perfect clusterings form a hierarchy?
Pp
1
C
1 Rp
C = {{d1 , d2 , d3 , d4 }}
30. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 13
Cluster Metric
Do perfect clusterings form a hierarchy?
Pp
1 C’
C
1 Rp
C = {{d1 , d2 , d3 , d4 }}
C = {{d1 , d2 }, {d3 , d4 }}
31. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 13
Cluster Metric
Do perfect clusterings form a hierarchy?
Pp
1 C’
C’’
C
1 Rp
C = {{d1 , d2 , d3 , d4 }}
C = {{d1 , d2 }, {d3 , d4 }} C = {{d1 , d2 , d3 }, {d4 }}
32. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 14
Optimum clustering
1 Introduction
2 Cluster Metric
3 Optimum clustering
4 Towards Optimum Clustering
5 Experiments
6 Conclusion and Outlook
33. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 15
Optimum clustering
Optimum Clustering
Usually, clustering process has no knowledge about
relevance judgments
switch from external to internal cluster measures
replace relevance judgments by estimates of probability of
relevance
requires probabilistic retrieval method yielding P(rel|q, d)
compute expected cluster quality
34. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 15
Optimum clustering
Optimum Clustering
Usually, clustering process has no knowledge about
relevance judgments
switch from external to internal cluster measures
replace relevance judgments by estimates of probability of
relevance
requires probabilistic retrieval method yielding P(rel|q, d)
compute expected cluster quality
35. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 15
Optimum clustering
Optimum Clustering
Usually, clustering process has no knowledge about
relevance judgments
switch from external to internal cluster measures
replace relevance judgments by estimates of probability of
relevance
requires probabilistic retrieval method yielding P(rel|q, d)
compute expected cluster quality
36. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 15
Optimum clustering
Optimum Clustering
Usually, clustering process has no knowledge about
relevance judgments
switch from external to internal cluster measures
replace relevance judgments by estimates of probability of
relevance
requires probabilistic retrieval method yielding P(rel|q, d)
compute expected cluster quality
37. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 15
Optimum clustering
Optimum Clustering
Usually, clustering process has no knowledge about
relevance judgments
switch from external to internal cluster measures
replace relevance judgments by estimates of probability of
relevance
requires probabilistic retrieval method yielding P(rel|q, d)
compute expected cluster quality
38. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 16
Optimum clustering
Expected cluster quality
Pairwise precision:
1 rik (rik − 1)
Pp (D, Q, R, C) = ci
|D| ci (ci − 1)
Ci ∈C qk ∈Q
ci >1
Expected precision:
1 ci
π(D, Q, C) =
! P(rel|qk , dl )P(rel|qk , dm )
|D| ci (ci − 1)
Ci ∈C qk ∈Q (dl ,dm )∈Ci ×Ci
|Ci |>1 dl =dm
39. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 16
Optimum clustering
Expected cluster quality
Pairwise precision:
1 rik (rik − 1)
Pp (D, Q, R, C) = ci
|D| ci (ci − 1)
Ci ∈C qk ∈Q
ci >1
Expected precision:
1 ci
π(D, Q, C) =
! P(rel|qk , dl )P(rel|qk , dm )
|D| ci (ci − 1)
Ci ∈C qk ∈Q (dl ,dm )∈Ci ×Ci
|Ci |>1 dl =dm
40. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 17
Optimum clustering
Expected precision
1 ci
π(D, Q, C) = P(rel|qk , dl )P(rel|qk , dm )
|D| ci (ci − 1)
Ci ∈C (dl ,dm )∈Ci ×Ci qk ∈Q
|Ci |>1 dl =dm
here qk ∈Q P(rel|qk , dl )P(rel|qk , dm ) gives the expected
number of queries for which both dl and dm are relevant
Transform a document into a vector of relevance probabilities:
τ T (dm ) = (P(rel|q1 , dm ), P(rel|q2 , dm ), . . . , P(rel|q|Q| , dm )).
1 1
π(D, Q, C) = τ T (dl ) · τ (dm )
|D| Ci ∈C
ci − 1 (dl ,dm )∈Ci ×Ci
|Ci |>1 dl =dm
41. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 17
Optimum clustering
Expected precision
1 ci
π(D, Q, C) = P(rel|qk , dl )P(rel|qk , dm )
|D| ci (ci − 1)
Ci ∈C (dl ,dm )∈Ci ×Ci qk ∈Q
|Ci |>1 dl =dm
here qk ∈Q P(rel|qk , dl )P(rel|qk , dm ) gives the expected
number of queries for which both dl and dm are relevant
Transform a document into a vector of relevance probabilities:
τ T (dm ) = (P(rel|q1 , dm ), P(rel|q2 , dm ), . . . , P(rel|q|Q| , dm )).
1 1
π(D, Q, C) = τ T (dl ) · τ (dm )
|D| Ci ∈C
ci − 1 (dl ,dm )∈Ci ×Ci
|Ci |>1 dl =dm
42. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 18
Optimum clustering
Expected recall
qk ∈Q Ci ∈C rik (rik − 1)
Rp (D, Q, R, C) =
qk ∈Q gk (gk − 1)
gk >1
Direct estimation requires estimation of denominator → biased
estimates
But: denominator is constant for a given query set → ignore
compute an estimate for the numerator only:
ρ(D, Q, C) = τ T (dl ) · τ (dm )
Ci ∈C (dl ,dm )∈Ci ×Ci
dl =dm
(Scalar product τ T (dl ) · τ (dm ) gives the expected number of
queries for which both dl and dm are relevant)
43. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 18
Optimum clustering
Expected recall
qk ∈Q Ci ∈C rik (rik − 1)
Rp (D, Q, R, C) =
qk ∈Q gk (gk − 1)
gk >1
Direct estimation requires estimation of denominator → biased
estimates
But: denominator is constant for a given query set → ignore
compute an estimate for the numerator only:
ρ(D, Q, C) = τ T (dl ) · τ (dm )
Ci ∈C (dl ,dm )∈Ci ×Ci
dl =dm
(Scalar product τ T (dl ) · τ (dm ) gives the expected number of
queries for which both dl and dm are relevant)
44. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 18
Optimum clustering
Expected recall
qk ∈Q Ci ∈C rik (rik − 1)
Rp (D, Q, R, C) =
qk ∈Q gk (gk − 1)
gk >1
Direct estimation requires estimation of denominator → biased
estimates
But: denominator is constant for a given query set → ignore
compute an estimate for the numerator only:
ρ(D, Q, C) = τ T (dl ) · τ (dm )
Ci ∈C (dl ,dm )∈Ci ×Ci
dl =dm
(Scalar product τ T (dl ) · τ (dm ) gives the expected number of
queries for which both dl and dm are relevant)
45. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 18
Optimum clustering
Expected recall
qk ∈Q Ci ∈C rik (rik − 1)
Rp (D, Q, R, C) =
qk ∈Q gk (gk − 1)
gk >1
Direct estimation requires estimation of denominator → biased
estimates
But: denominator is constant for a given query set → ignore
compute an estimate for the numerator only:
ρ(D, Q, C) = τ T (dl ) · τ (dm )
Ci ∈C (dl ,dm )∈Ci ×Ci
dl =dm
(Scalar product τ T (dl ) · τ (dm ) gives the expected number of
queries for which both dl and dm are relevant)
46. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 19
Optimum clustering
Optimum clustering
C is an optimum clustering iff there exists no clustering C s.th.
π(D, Q, C) < π(D, Q, C ) ∧ ρ(D, Q, C) < ρ(D, Q, C )
Pareto optima
Set of perfect (and optimum) clusterings not even forms a
cluster hierarchy
no hierarchic clustering method will find all optima!
47. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 19
Optimum clustering
Optimum clustering
C is an optimum clustering iff there exists no clustering C s.th.
π(D, Q, C) < π(D, Q, C ) ∧ ρ(D, Q, C) < ρ(D, Q, C )
Pareto optima
Set of perfect (and optimum) clusterings not even forms a
cluster hierarchy
no hierarchic clustering method will find all optima!
48. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 19
Optimum clustering
Optimum clustering
C is an optimum clustering iff there exists no clustering C s.th.
π(D, Q, C) < π(D, Q, C ) ∧ ρ(D, Q, C) < ρ(D, Q, C )
Pareto optima
Set of perfect (and optimum) clusterings not even forms a
cluster hierarchy
no hierarchic clustering method will find all optima!
49. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 19
Optimum clustering
Optimum clustering
C is an optimum clustering iff there exists no clustering C s.th.
π(D, Q, C) < π(D, Q, C ) ∧ ρ(D, Q, C) < ρ(D, Q, C )
Pareto optima
Set of perfect (and optimum) clusterings not even forms a
cluster hierarchy
no hierarchic clustering method will find all optima!
50. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 20
Towards Optimum Clustering
1 Introduction
2 Cluster Metric
3 Optimum clustering
4 Towards Optimum Clustering
5 Experiments
6 Conclusion and Outlook
51. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 21
Towards Optimum Clustering
Towards Optimum Clustering
Development of an (optimum) clustering method
1 Set of queries,
2 Probabilistic retrieval method,
3 Document similarity metric, and
4 Fusion principle.
52. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 22
Towards Optimum Clustering
A simple application
1 Set of queries: all possible one-term queries
2 Probabilistic retrieval method: tf ∗ idf
3 Document similarity metric: τ T (dl ) · τ (dm )
4 Fusion principle: group average clustering
1
π(D, Q, C) = τ T (dl ) · τ (dm )
c(c − 1) (dl ,dm )∈C×C
dl =dm
standard clustering method
53. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 22
Towards Optimum Clustering
A simple application
1 Set of queries: all possible one-term queries
2 Probabilistic retrieval method: tf ∗ idf
3 Document similarity metric: τ T (dl ) · τ (dm )
4 Fusion principle: group average clustering
1
π(D, Q, C) = τ T (dl ) · τ (dm )
c(c − 1) (dl ,dm )∈C×C
dl =dm
standard clustering method
54. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 22
Towards Optimum Clustering
A simple application
1 Set of queries: all possible one-term queries
2 Probabilistic retrieval method: tf ∗ idf
3 Document similarity metric: τ T (dl ) · τ (dm )
4 Fusion principle: group average clustering
1
π(D, Q, C) = τ T (dl ) · τ (dm )
c(c − 1) (dl ,dm )∈C×C
dl =dm
standard clustering method
55. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 22
Towards Optimum Clustering
A simple application
1 Set of queries: all possible one-term queries
2 Probabilistic retrieval method: tf ∗ idf
3 Document similarity metric: τ T (dl ) · τ (dm )
4 Fusion principle: group average clustering
1
π(D, Q, C) = τ T (dl ) · τ (dm )
c(c − 1) (dl ,dm )∈C×C
dl =dm
standard clustering method
56. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 22
Towards Optimum Clustering
A simple application
1 Set of queries: all possible one-term queries
2 Probabilistic retrieval method: tf ∗ idf
3 Document similarity metric: τ T (dl ) · τ (dm )
4 Fusion principle: group average clustering
1
π(D, Q, C) = τ T (dl ) · τ (dm )
c(c − 1) (dl ,dm )∈C×C
dl =dm
standard clustering method
57. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 22
Towards Optimum Clustering
A simple application
1 Set of queries: all possible one-term queries
2 Probabilistic retrieval method: tf ∗ idf
3 Document similarity metric: τ T (dl ) · τ (dm )
4 Fusion principle: group average clustering
1
π(D, Q, C) = τ T (dl ) · τ (dm )
c(c − 1) (dl ,dm )∈C×C
dl =dm
standard clustering method
58. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 22
Towards Optimum Clustering
A simple application
1 Set of queries: all possible one-term queries
2 Probabilistic retrieval method: tf ∗ idf
3 Document similarity metric: τ T (dl ) · τ (dm )
4 Fusion principle: group average clustering
1
π(D, Q, C) = τ T (dl ) · τ (dm )
c(c − 1) (dl ,dm )∈C×C
dl =dm
standard clustering method
59. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 23
Towards Optimum Clustering
Query set
Too few queries in real collections → artificial query set
collection clustering: set of all possible one-term queries
Probability distribution over the query set: uniform /
proportional to doc. freq.
Document representation: original terms / transformations
of the term space
Semantic dimensions: focus on certain aspects only (e.g.
images: color, contour, texture)
result clustering: set of all query expansions
60. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 24
Towards Optimum Clustering
Probabilistic retrieval method
Model: In principle, any retrieval model suitable
Transformation to probabilities: direct estimation /
transforming the retrieval score into such a probability
61. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 25
Towards Optimum Clustering
Document similarity metric.
fixed as τ T (dl ) · τ (dm )
62. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 26
Towards Optimum Clustering
Fusion principles
OCF only gives guidelines for good fusion principles:
consider metrics π and/or ρ during fusion
63. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 27
Towards Optimum Clustering
Group average clustering:
1
σ(C) = τ T (dl ) · τ (dm )
c(c − 1) (dl ,dm )∈C×C
dl =dm
expected precision as criterion!
starts with singleton clusters minimum recall
building larger clusters for increasing recall
forms cluster with highest precision
(which may be lower than that of the current clusters)
64. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 27
Towards Optimum Clustering
Group average clustering:
1
σ(C) = τ T (dl ) · τ (dm )
c(c − 1) (dl ,dm )∈C×C
dl =dm
expected precision as criterion!
starts with singleton clusters minimum recall
building larger clusters for increasing recall
forms cluster with highest precision
(which may be lower than that of the current clusters)
65. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 27
Towards Optimum Clustering
Group average clustering:
1
σ(C) = τ T (dl ) · τ (dm )
c(c − 1) (dl ,dm )∈C×C
dl =dm
expected precision as criterion!
starts with singleton clusters minimum recall
building larger clusters for increasing recall
forms cluster with highest precision
(which may be lower than that of the current clusters)
66. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 27
Towards Optimum Clustering
Group average clustering:
1
σ(C) = τ T (dl ) · τ (dm )
c(c − 1) (dl ,dm )∈C×C
dl =dm
expected precision as criterion!
starts with singleton clusters minimum recall
building larger clusters for increasing recall
forms cluster with highest precision
(which may be lower than that of the current clusters)
67. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 28
Towards Optimum Clustering
Fusion principles – min cut
starts with single cluster (maximum recall)
searches for cut with minimum loss in recall
ρ(D, Q, C) = τ T (dl ) · τ (dm )
Ci ∈C (dl ,dm )∈C×C
dl =dm
consider expected precision for breaking ties!
68. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 28
Towards Optimum Clustering
Fusion principles – min cut
starts with single cluster (maximum recall)
searches for cut with minimum loss in recall
ρ(D, Q, C) = τ T (dl ) · τ (dm )
Ci ∈C (dl ,dm )∈C×C
dl =dm
consider expected precision for breaking ties!
69. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 28
Towards Optimum Clustering
Fusion principles – min cut
starts with single cluster (maximum recall)
searches for cut with minimum loss in recall
ρ(D, Q, C) = τ T (dl ) · τ (dm )
Ci ∈C (dl ,dm )∈C×C
dl =dm
consider expected precision for breaking ties!
70. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 29
Towards Optimum Clustering
Finding optimum clusterings
Min cut
(assume cohesive similarity graph)
starts with optimum clustering for maximum recall
min cut finds split with minimum loss in recall
consider precision for tie breaking
optimum clustering for two clusters
O(n3 ) (vs. O(2n ) for the general case)
subsequent splits will not necessarily reach optima
Group average
in general, multiple fusion steps for reaching first optimum
greedy strategy does not necessarily find this optimum!
71. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 29
Towards Optimum Clustering
Finding optimum clusterings
Min cut
(assume cohesive similarity graph)
starts with optimum clustering for maximum recall
min cut finds split with minimum loss in recall
consider precision for tie breaking
optimum clustering for two clusters
O(n3 ) (vs. O(2n ) for the general case)
subsequent splits will not necessarily reach optima
Group average
in general, multiple fusion steps for reaching first optimum
greedy strategy does not necessarily find this optimum!
72. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 29
Towards Optimum Clustering
Finding optimum clusterings
Min cut
(assume cohesive similarity graph)
starts with optimum clustering for maximum recall
min cut finds split with minimum loss in recall
consider precision for tie breaking
optimum clustering for two clusters
O(n3 ) (vs. O(2n ) for the general case)
subsequent splits will not necessarily reach optima
Group average
in general, multiple fusion steps for reaching first optimum
greedy strategy does not necessarily find this optimum!
73. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 29
Towards Optimum Clustering
Finding optimum clusterings
Min cut
(assume cohesive similarity graph)
starts with optimum clustering for maximum recall
min cut finds split with minimum loss in recall
consider precision for tie breaking
optimum clustering for two clusters
O(n3 ) (vs. O(2n ) for the general case)
subsequent splits will not necessarily reach optima
Group average
in general, multiple fusion steps for reaching first optimum
greedy strategy does not necessarily find this optimum!
74. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 29
Towards Optimum Clustering
Finding optimum clusterings
Min cut
(assume cohesive similarity graph)
starts with optimum clustering for maximum recall
min cut finds split with minimum loss in recall
consider precision for tie breaking
optimum clustering for two clusters
O(n3 ) (vs. O(2n ) for the general case)
subsequent splits will not necessarily reach optima
Group average
in general, multiple fusion steps for reaching first optimum
greedy strategy does not necessarily find this optimum!
75. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 29
Towards Optimum Clustering
Finding optimum clusterings
Min cut
(assume cohesive similarity graph)
starts with optimum clustering for maximum recall
min cut finds split with minimum loss in recall
consider precision for tie breaking
optimum clustering for two clusters
O(n3 ) (vs. O(2n ) for the general case)
subsequent splits will not necessarily reach optima
Group average
in general, multiple fusion steps for reaching first optimum
greedy strategy does not necessarily find this optimum!
76. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 29
Towards Optimum Clustering
Finding optimum clusterings
Min cut
(assume cohesive similarity graph)
starts with optimum clustering for maximum recall
min cut finds split with minimum loss in recall
consider precision for tie breaking
optimum clustering for two clusters
O(n3 ) (vs. O(2n ) for the general case)
subsequent splits will not necessarily reach optima
Group average
in general, multiple fusion steps for reaching first optimum
greedy strategy does not necessarily find this optimum!
77. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 29
Towards Optimum Clustering
Finding optimum clusterings
Min cut
(assume cohesive similarity graph)
starts with optimum clustering for maximum recall
min cut finds split with minimum loss in recall
consider precision for tie breaking
optimum clustering for two clusters
O(n3 ) (vs. O(2n ) for the general case)
subsequent splits will not necessarily reach optima
Group average
in general, multiple fusion steps for reaching first optimum
greedy strategy does not necessarily find this optimum!
78. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 29
Towards Optimum Clustering
Finding optimum clusterings
Min cut
(assume cohesive similarity graph)
starts with optimum clustering for maximum recall
min cut finds split with minimum loss in recall
consider precision for tie breaking
optimum clustering for two clusters
O(n3 ) (vs. O(2n ) for the general case)
subsequent splits will not necessarily reach optima
Group average
in general, multiple fusion steps for reaching first optimum
greedy strategy does not necessarily find this optimum!
79. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 30
Experiments
1 Introduction
2 Cluster Metric
3 Optimum clustering
4 Towards Optimum Clustering
5 Experiments
6 Conclusion and Outlook
80. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 31
Experiments
Experiments with a Query Set
ADI collection:
35 queries
70 documents (relevant to 2.4 queries on avg.)
Experiments:
Q35opt using the actual relevance in τ (d)
Q35 BM25 estimates for the 35 queries
1Tuni 1-term queries, uniform distribution
1Tdf 1-term queries, according to document frequency
82. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 33
Experiments
Using Keyphrases as Query Set
Compare clustering results based on different query sets
1 ‘bag-of-words’: single words as queries
2 keyphrases automatically extracted as head-noun phrases,
single query = all keyphrases of a document
Test collections:
4 test collections assembled from the RCV1 (Reuters)
news corpus
# documents: 600 vs. 6000
# categories: 6 vs. 12,
Frequency distribution of classes: ([U]niform vs.
[R]andom).
83. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 33
Experiments
Using Keyphrases as Query Set
Compare clustering results based on different query sets
1 ‘bag-of-words’: single words as queries
2 keyphrases automatically extracted as head-noun phrases,
single query = all keyphrases of a document
Test collections:
4 test collections assembled from the RCV1 (Reuters)
news corpus
# documents: 600 vs. 6000
# categories: 6 vs. 12,
Frequency distribution of classes: ([U]niform vs.
[R]andom).
84. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 34
Experiments
Using Keyphrases as Query Set - Results
Average Precision (External) F-measure
85. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 35
Experiments
Evaluation of the Expected F-Measure
Correlation between expected F-Measure (internal measure)
and
standard F-measure (comparison with reference classification)
test collections as before
regard quality of 40 different clustering methods for each
setting
(find optimum clustering among these 40 methods)
86. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 36
Experiments
Correlation results
Pearson correlation between internal measures and the
external F-Measure
87. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 37
Conclusion and Outlook
1 Introduction
2 Cluster Metric
3 Optimum clustering
4 Towards Optimum Clustering
5 Experiments
6 Conclusion and Outlook
88. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 38
Conclusion and Outlook
Summary
Optimum Clustering Framework
makes Cluster Hypothesis a requirement
forms theoretical basis for development of better clustering
methods
yields positive experimental evidence
89. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 39
Conclusion and Outlook
Further Research
theoretical
compatibility of existing clustering methods with OCF
extension of OCF to soft clustering
extension of OCF to hierarchical clustering
experimental
variation of query sets
user experiments