The document describes the k-means++ seeding algorithm for initializing k-means clustering. It presents the k-means++ algorithm, provides an implementation in MLDemos, and evaluates it on test and real datasets. The results show k-means++ yields a significant reduction in clustering error compared to random initialization, providing better separation of clusters. However, the document also notes there are many seeding techniques and some may work better than k-means++ for certain datasets.
IAC 2024 - IA Fast Track to Search Focused AI Solutions
K-Means++ Seeding Algorithm Implementation
1. K-means++ Seeding Algorithm,
Implementation in MLDemos!
Renaud Richardet!
Brain Mind Institute !
Ecole Polytechnique Fédérale
de Lausanne (EPFL), Switzerland!
renaud.richardet@epfl.ch !
!
2. K-means!
• K-means: widely used clustering technique!
• Initialization: blind random on input data!
• Drawback: very sensitive to choice of initial cluster
centers (seeds)!
• Local optimal can be arbitrarily bad wrt. objective
function, compared to global optimal clustering!
3. K-means++!
• A seeding technique for k-means
from Arthur and Vassilvitskii [2007]!
• Idea: spread the k initial cluster centers away from
each other.!
• O(log k)-competitive with the optimal clustering"
• substantial convergence time speedups (empirical)!
4. Algorithm!
c
∈
C:
cluster
center
x
∈
X:
data
point
D(x):
distance
between
x
and
the
nearest
ck
that
has
already
been
chosen
5. Implementation!
• Based on Apache Commons Math’s
KMeansPlusPlusClusterer and
Arthur’s [2007] implementation!
• Implemented directly in MLDemos’ core!
8. Sample Output!
1:
first
cluster
center
0
at
rand:
x=4
[-‐2.0;
2.0]
1:
initial
minDist
for
0
[-‐1.0;-‐1.0]
=
10.0
1:
initial
minDist
for
1
[
2.0;
1.0]
=
17.0
1:
initial
minDist
for
2
[
1.0;-‐1.0]
=
18.0
1:
initial
minDist
for
3
[-‐1.0;-‐2.0]
=
17.0
1:
initial
minDist
for
5
[
2.0;
2.0]
=
16.0
1:
initial
minDist
for
6
[
2.0;-‐2.0]
=
32.0
1:
initial
minDist
for
7
[-‐1.0;
2.0]
=
1.0
1:
initial
minDist
for
8
[-‐2.0;-‐2.0]
=
16.0
1:
initial
minDist
for
9
[
1.0;
1.0]
=
10.0
1:
initial
minDist
for
10[
2.0;-‐1.0]
=
25.0
1:
initial
minDist
for
11[-‐2.0;-‐1.0]
=
9.0
[…]
2:
picking
cluster
center
1
-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐
3:
distSqSum=3345.0
3:
random
index
1532.706909
4:
new
cluster
point:
x=6
[2.0;-‐2.0]
9. Sample Output (2)!
4:
updating
minDist
for
0
[-‐1.0;-‐1.0]
=
10.0
4:
updating
minDist
for
1
[
2.0;
1.0]
=
9.0
4:
updating
minDist
for
2
[
1.0;-‐1.0]
=
2.0
4:
updating
minDist
for
3
[-‐1.0;-‐2.0]
=
9.0
4:
updating
minDist
for
5
[
2.0;
2.0]
=
16.0
4:
updating
minDist
for
7
[-‐1.0;
2.0]
=
25.0
4:
updating
minDist
for
8
[-‐2.0;-‐2.0]
=
16.0
4:
updating
minDist
for
9
[
1.0;
1.0]
=
10.0
4:
updating
minDist
for
10[2.0
;-‐1.0]
=
1.0
4:
updating
minDist
for
11[-‐2.0;-‐1.0]
=
17.0
[…]
2:
picking
cluster
center
2
-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐
3:
distSqSum=961.0
3:
random
index
103.404701
4:
new
cluster
point:
x=1
[2.0;1.0]
4:
updating
minDist
for
0
[-‐1.0;-‐1.0]
=
13.0
[…]
10. Evaluation on Test Dataset!
• 200 clustering runs, each with and without k-
means++ initialization!
• Measure RSS (intra-class variance)!
• K-means!
optimal clustering 115 times (57.5%) !
• K-means++ !
optimal clustering 182 times (91%)!
11. Comparison of the frequency distribution of
RSS values between k-means and k-means
++ on the evaluation dataset (n=200)!
12. Evaluation on Real Dataset!
• UCI’s Water Treatment Plant data set
daily measures of sensors in an urban waste water
treatment plant (n=396, d=38)!
• Sampled two times 500 clustering runs for k-means
and k-means++ with k=13, and recorded RSS!
• Difference highly significant (P < 0.0001) !
13. Comparison of the frequency distribution of
RSS values between k-means and k-means
++ on the UCI real world dataset (n=500)!
14. Alternatives Seeding Algorithms!
• Extensive research into seeding techniques for k-
means.!
• Steinley [2007]: evaluated 12 different techniques
(omitting k-means++). Recommends multiple
random starting points for general use.!
• Maitra [2011] evaluated 11 techniques (including k-
means++). Unable to provide recommendations
when evaluating nine standard real-world datasets. !
• Maitra analyzed simulated datasets and
recommends using Milligan’s [1980] or Mirkin’s
[2005] seeding technique, and Bradley’s [1998]
when dataset is very large.!
15. Conclusions and Future Work!
• Using a synthetic test dataset and a real world
dataset, we showed that our implementation of
the k-means++ seeding procedure in the
MLDemos software package yields a significant
reduction of the RSS. !
• A short literature survey revealed that many
seeding procedures exist for k-means, and that
some alternatives to k-means++ might yield
even larger improvements.!
16. References!
• Arthur, D. & Vassilvitskii, S.: “k-means++: The advantages of careful
seeding”. Proceedings of the eighteenth annual ACM-SIAM symposium on
Discrete algorithms 1027–1035 (2007).!
• Bahmani, B., Moseley, B., Vattani, A., Kumar, R. & Vassilvitskii, S.: “Scalable
K-Means+”. Unpublished working paper available at
http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf (2012).!
• Bradley P. S. & Fayyad U. M.: “Refining initial points. for K-Means
clustering”. Proc. 15th International Conf. on Machine Learning, 91-99
(1998).!
• Maitra, R., Peterson, A. D. & Ghosh, A. P.: “A systematic evaluation of
different methods for initializing the K-means clustering algorithm”.
Unpublished working paper available at http://apghosh.public.iastate.edu/
files/IEEEclust2.pdf (2011).!
• Milligan G. W.: “The validation of four ultrametric clustering algorithms”.
Pattern Recognition, vol. 12, 41–50 (1980). !
• Mirkin B.: “Clustering for data mining: A data recovery approach”. Chapman
and Hall (2005). !
• Steinley, D. & Brusco, M. J.: “Initializing k-means batch clustering: A critical
evaluation of several techniques”. Journal of Classification 24, 99–121
(2007).!