Kmeans plusplus

K-means++ Seeding Algorithm,  
Implementation in MLDemos!

Renaud Richardet!
Brain Mind Institute !
Ecole Polytechnique Fédérale  
de Lausanne (EPFL), Switzerland!
renaud.richardet@epﬂ.ch !
!

K-means!
•  K-means: widely used clustering technique!
•  Initialization: blind random on input data!
•  Drawback: very sensitive to choice of initial cluster
centers (seeds)!
•  Local optimal can be arbitrarily bad wrt. objective
function, compared to global optimal clustering!

K-means++!
•  A seeding technique for k-means 
from Arthur and Vassilvitskii [2007]!
•  Idea: spread the k initial cluster centers away from
each other.!
•  O(log k)-competitive with the optimal clustering"
•  substantial convergence time speedups (empirical)!

Algorithm!

c
∈
C:
cluster
center

x
∈

X:
data
point

D(x):
distance
between
x
and
the
nearest
ck
that
has
already
been
chosen

Implementation!
•  Based on Apache Commons Math’s
KMeansPlusPlusClusterer and  
Arthur’s [2007] implementation!
•  Implemented directly in MLDemos’ core!

Implementation Test Dataset: 4 squares (n=16)!

Sample Output!

1:
first
cluster
center
0
at
rand:
x=4
[-‐2.0;
2.0]

1:
initial
minDist
for
0
[-‐1.0;-‐1.0]
=
10.0

1:
initial
minDist
for
1
[
2.0;
1.0]
=
17.0

1:
initial
minDist
for
2
[
1.0;-‐1.0]
=
18.0

1:
initial
minDist
for
3
[-‐1.0;-‐2.0]
=
17.0

1:
initial
minDist
for
5
[
2.0;
2.0]
=
16.0

1:
initial
minDist
for
6
[
2.0;-‐2.0]
=
32.0

1:
initial
minDist
for
7
[-‐1.0;
2.0]
=

1.0

1:
initial
minDist
for
8
[-‐2.0;-‐2.0]
=
16.0

1:
initial
minDist
for
9
[
1.0;
1.0]
=
10.0

1:
initial
minDist
for
10[
2.0;-‐1.0]
=
25.0

1:
initial
minDist
for
11[-‐2.0;-‐1.0]
=

9.0

[…]

2:
picking
cluster
center
1
-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐

3:

distSqSum=3345.0

3:

random
index
1532.706909

4:

new
cluster
point:
x=6
[2.0;-‐2.0]

Sample Output (2)!

4:

updating
minDist
for
0
[-‐1.0;-‐1.0]
=
10.0

4:

updating
minDist
for
1
[
2.0;
1.0]
=

9.0

4:

updating
minDist
for
2
[
1.0;-‐1.0]
=

2.0

4:

updating
minDist
for
3
[-‐1.0;-‐2.0]
=

9.0

4:

updating
minDist
for
5
[
2.0;
2.0]
=
16.0

4:

updating
minDist
for
7
[-‐1.0;
2.0]
=
25.0

4:

updating
minDist
for
8
[-‐2.0;-‐2.0]
=
16.0

4:

updating
minDist
for
9
[
1.0;
1.0]
=
10.0

4:

updating
minDist
for
10[2.0
;-‐1.0]
=

1.0

4:

updating
minDist
for
11[-‐2.0;-‐1.0]
=
17.0

[…]

2:
picking
cluster
center
2
-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐

3:

distSqSum=961.0

3:

random
index
103.404701

4:

new
cluster
point:
x=1
[2.0;1.0]

4:

updating
minDist
for
0
[-‐1.0;-‐1.0]
=
13.0

[…]

Evaluation on Test Dataset!
•  200 clustering runs, each with and without k-
means++ initialization!
•  Measure RSS (intra-class variance)!

•  K-means! 
optimal clustering 115 times (57.5%) !
•  K-means++ ! 
optimal clustering 182 times (91%)!

Comparison of the frequency distribution of
RSS values between k-means and k-means
++ on the evaluation dataset (n=200)!

Evaluation on Real Dataset!
•  UCI’s Water Treatment Plant data set 
daily measures of sensors in an urban waste water
treatment plant (n=396, d=38)!
•  Sampled two times 500 clustering runs for k-means
and k-means++ with k=13, and recorded RSS!

•  Difference highly signiﬁcant (P < 0.0001) !

Comparison of the frequency distribution of
RSS values between k-means and k-means
++ on the UCI real world dataset (n=500)!

Alternatives Seeding Algorithms!
•  Extensive research into seeding techniques for k-
means.!
•  Steinley [2007]: evaluated 12 different techniques
(omitting k-means++). Recommends multiple
random starting points for general use.!
•  Maitra [2011] evaluated 11 techniques (including k-
means++). Unable to provide recommendations
when evaluating nine standard real-world datasets. !
•  Maitra analyzed simulated datasets and
recommends using Milligan’s [1980] or Mirkin’s
[2005] seeding technique, and Bradley’s [1998]
when dataset is very large.!

Conclusions and Future Work!
•  Using a synthetic test dataset and a real world
dataset, we showed that our implementation of
the k-means++ seeding procedure in the
MLDemos software package yields a signiﬁcant
reduction of the RSS. !
•  A short literature survey revealed that many
seeding procedures exist for k-means, and that
some alternatives to k-means++ might yield
even larger improvements.!

References!
•  Arthur, D. & Vassilvitskii, S.: “k-means++: The advantages of careful
seeding”. Proceedings of the eighteenth annual ACM-SIAM symposium on
Discrete algorithms 1027–1035 (2007).!
•  Bahmani, B., Moseley, B., Vattani, A., Kumar, R. & Vassilvitskii, S.: “Scalable
K-Means+”. Unpublished working paper available at
http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf (2012).!
•  Bradley P. S. & Fayyad U. M.: “Refining initial points. for K-Means
clustering”. Proc. 15th International Conf. on Machine Learning, 91-99
(1998).!
•  Maitra, R., Peterson, A. D. & Ghosh, A. P.: “A systematic evaluation of
different methods for initializing the K-means clustering algorithm”.
Unpublished working paper available at http://apghosh.public.iastate.edu/
files/IEEEclust2.pdf (2011).!
•  Milligan G. W.: “The validation of four ultrametric clustering algorithms”.
Pattern Recognition, vol. 12, 41–50 (1980). !
•  Mirkin B.: “Clustering for data mining: A data recovery approach”. Chapman
and Hall (2005). !
•  Steinley, D. & Brusco, M. J.: “Initializing k-means batch clustering: A critical
evaluation of several techniques”. Journal of Classification 24, 99–121
(2007).!

Kmeans plusplus

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (20)

Similar a Kmeans plusplus

Similar a Kmeans plusplus (20)

Último

Último (20)

Kmeans plusplus