SlideShare una empresa de Scribd logo
1 de 16
Descargar para leer sin conexión
K-means++ Seeding Algorithm, 

 Implementation in MLDemos!

             Renaud Richardet!
             Brain Mind Institute !
       Ecole Polytechnique Fédérale 

      de Lausanne (EPFL), Switzerland!
          renaud.richardet@epfl.ch !
                      !
K-means!
•  K-means: widely used clustering technique!
•  Initialization: blind random on input data!
•  Drawback: very sensitive to choice of initial cluster
   centers (seeds)!
•  Local optimal can be arbitrarily bad wrt. objective
   function, compared to global optimal clustering!
K-means++!
•  A seeding technique for k-means

   from Arthur and Vassilvitskii [2007]!
•  Idea: spread the k initial cluster centers away from
   each other.!
•  O(log k)-competitive with the optimal clustering"
•  substantial convergence time speedups (empirical)!
Algorithm!




c	
  ∈	
  C:	
  cluster	
  center	
  
x	
  ∈	
  	
  X:	
  data	
  point	
  
D(x):	
  distance	
  between	
  x	
  and	
  the	
  nearest	
  ck	
  that	
  has	
  already	
  been	
  chosen	
  	
  
	
  
Implementation!
•  Based on Apache Commons Math’s
   KMeansPlusPlusClusterer and 

   Arthur’s [2007] implementation!
•  Implemented directly in MLDemos’ core!
Implementation Test Dataset: 4 squares (n=16)!
Expected: 4 nice clusters!
Sample Output!
	
  1:	
  first	
  cluster	
  center	
  0	
  at	
  rand:	
  x=4	
  [-­‐2.0;	
  2.0]	
  
	
  1:	
  initial	
  minDist	
  for	
  0	
  [-­‐1.0;-­‐1.0]	
  =	
  10.0	
  
	
  1:	
  initial	
  minDist	
  for	
  1	
  [	
  2.0;	
  1.0]	
  =	
  17.0	
  
	
  1:	
  initial	
  minDist	
  for	
  2	
  [	
  1.0;-­‐1.0]	
  =	
  18.0	
  
	
  1:	
  initial	
  minDist	
  for	
  3	
  [-­‐1.0;-­‐2.0]	
  =	
  17.0	
  
	
  1:	
  initial	
  minDist	
  for	
  5	
  [	
  2.0;	
  2.0]	
  =	
  16.0	
  
	
  1:	
  initial	
  minDist	
  for	
  6	
  [	
  2.0;-­‐2.0]	
  =	
  32.0	
  
	
  1:	
  initial	
  minDist	
  for	
  7	
  [-­‐1.0;	
  2.0]	
  =	
  	
  1.0	
  
	
  1:	
  initial	
  minDist	
  for	
  8	
  [-­‐2.0;-­‐2.0]	
  =	
  16.0	
  
	
  1:	
  initial	
  minDist	
  for	
  9	
  [	
  1.0;	
  1.0]	
  =	
  10.0	
  
	
  1:	
  initial	
  minDist	
  for	
  10[	
  2.0;-­‐1.0]	
  =	
  25.0	
  
	
  1:	
  initial	
  minDist	
  for	
  11[-­‐2.0;-­‐1.0]	
  =	
  	
  9.0	
  
	
  	
  	
  	
  […]	
  
	
  2:	
  picking	
  cluster	
  center	
  1	
  -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐	
  
	
  3:	
  	
  	
  distSqSum=3345.0	
  
	
  3:	
  	
  	
  random	
  index	
  1532.706909	
  
	
  4:	
  	
  new	
  cluster	
  point:	
  x=6	
  [2.0;-­‐2.0]	
  	
  
Sample Output (2)!
	
  4:	
  	
  	
  updating	
  minDist	
  for	
  0	
  [-­‐1.0;-­‐1.0]	
  =	
  10.0	
  
	
  4:	
  	
  	
  updating	
  minDist	
  for	
  1	
  [	
  2.0;	
  1.0]	
  =	
  	
  9.0	
  
	
  4:	
  	
  	
  updating	
  minDist	
  for	
  2	
  [	
  1.0;-­‐1.0]	
  =	
  	
  2.0	
  
	
  4:	
  	
  	
  updating	
  minDist	
  for	
  3	
  [-­‐1.0;-­‐2.0]	
  =	
  	
  9.0	
  
	
  4:	
  	
  	
  updating	
  minDist	
  for	
  5	
  [	
  2.0;	
  2.0]	
  =	
  16.0	
  
	
  4:	
  	
  	
  updating	
  minDist	
  for	
  7	
  [-­‐1.0;	
  2.0]	
  =	
  25.0	
  
	
  4:	
  	
  	
  updating	
  minDist	
  for	
  8	
  [-­‐2.0;-­‐2.0]	
  =	
  16.0	
  
	
  4:	
  	
  	
  updating	
  minDist	
  for	
  9	
  [	
  1.0;	
  1.0]	
  =	
  10.0	
  
	
  4:	
  	
  	
  updating	
  minDist	
  for	
  10[2.0	
  ;-­‐1.0]	
  =	
  	
  1.0	
  
	
  4:	
  	
  	
  updating	
  minDist	
  for	
  11[-­‐2.0;-­‐1.0]	
  =	
  17.0	
  
              	
  […]	
  
	
  2:	
  picking	
  cluster	
  center	
  2	
  -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐	
  
	
  3:	
  	
  	
  distSqSum=961.0	
  
	
  3:	
  	
  	
  random	
  index	
  103.404701	
  
	
  4:	
  	
  	
  new	
  cluster	
  point:	
  x=1	
  [2.0;1.0]	
  
	
  4:	
  	
  	
  updating	
  minDist	
  for	
  0	
  [-­‐1.0;-­‐1.0]	
  =	
  13.0	
  
	
  […]	
  
Evaluation on Test Dataset!
•  200 clustering runs, each with and without k-
   means++ initialization!
•  Measure RSS (intra-class variance)!

•  K-means!

   optimal clustering 115 times (57.5%) !
•  K-means++ !

   optimal clustering 182 times (91%)!
Comparison of the frequency distribution of
RSS values between k-means and k-means
++ on the evaluation dataset (n=200)!
Evaluation on Real Dataset!
•  UCI’s Water Treatment Plant data set

   daily measures of sensors in an urban waste water
   treatment plant (n=396, d=38)!
•  Sampled two times 500 clustering runs for k-means
   and k-means++ with k=13, and recorded RSS!




•  Difference highly significant (P < 0.0001) !
Comparison of the frequency distribution of
RSS values between k-means and k-means
++ on the UCI real world dataset (n=500)!
Alternatives Seeding Algorithms!
•  Extensive research into seeding techniques for k-
   means.!
•  Steinley [2007]: evaluated 12 different techniques
   (omitting k-means++). Recommends multiple
   random starting points for general use.!
•  Maitra [2011] evaluated 11 techniques (including k-
   means++). Unable to provide recommendations
   when evaluating nine standard real-world datasets. !
•  Maitra analyzed simulated datasets and
   recommends using Milligan’s [1980] or Mirkin’s
   [2005] seeding technique, and Bradley’s [1998]
   when dataset is very large.!
Conclusions and Future Work!
•  Using a synthetic test dataset and a real world
   dataset, we showed that our implementation of
   the k-means++ seeding procedure in the
   MLDemos software package yields a significant
   reduction of the RSS. !
•  A short literature survey revealed that many
   seeding procedures exist for k-means, and that
   some alternatives to k-means++ might yield
   even larger improvements.!
References!
•    Arthur, D. & Vassilvitskii, S.: “k-means++: The advantages of careful
     seeding”. Proceedings of the eighteenth annual ACM-SIAM symposium on
     Discrete algorithms 1027–1035 (2007).!
•    Bahmani, B., Moseley, B., Vattani, A., Kumar, R. & Vassilvitskii, S.: “Scalable
     K-Means+”. Unpublished working paper available at
     http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf (2012).!
•    Bradley P. S. & Fayyad U. M.: “Refining initial points. for K-Means
     clustering”. Proc. 15th International Conf. on Machine Learning, 91-99
     (1998).!
•    Maitra, R., Peterson, A. D. & Ghosh, A. P.: “A systematic evaluation of
     different methods for initializing the K-means clustering algorithm”.
     Unpublished working paper available at http://apghosh.public.iastate.edu/
     files/IEEEclust2.pdf (2011).!
•    Milligan G. W.: “The validation of four ultrametric clustering algorithms”.
     Pattern Recognition, vol. 12, 41–50 (1980). !
•    Mirkin B.: “Clustering for data mining: A data recovery approach”. Chapman
     and Hall (2005). !
•    Steinley, D. & Brusco, M. J.: “Initializing k-means batch clustering: A critical
     evaluation of several techniques”. Journal of Classification 24, 99–121
     (2007).!

Más contenido relacionado

La actualidad más candente

正則化つき線形モデル(「入門機械学習第6章」より)
正則化つき線形モデル(「入門機械学習第6章」より)正則化つき線形モデル(「入門機械学習第6章」より)
正則化つき線形モデル(「入門機械学習第6章」より)
Eric Sartre
 

La actualidad más candente (20)

機械学習とコンピュータビジョン入門
機械学習とコンピュータビジョン入門機械学習とコンピュータビジョン入門
機械学習とコンピュータビジョン入門
 
決定木学習
決定木学習決定木学習
決定木学習
 
距離学習を導入した二値分類モデルによる異常音検知
距離学習を導入した二値分類モデルによる異常音検知距離学習を導入した二値分類モデルによる異常音検知
距離学習を導入した二値分類モデルによる異常音検知
 
幾何と機械学習: A Short Intro
幾何と機械学習: A Short Intro幾何と機械学習: A Short Intro
幾何と機械学習: A Short Intro
 
20150414seminar
20150414seminar20150414seminar
20150414seminar
 
正則化つき線形モデル(「入門機械学習第6章」より)
正則化つき線形モデル(「入門機械学習第6章」より)正則化つき線形モデル(「入門機械学習第6章」より)
正則化つき線形モデル(「入門機械学習第6章」より)
 
Social LSTMの紹介
Social LSTMの紹介Social LSTMの紹介
Social LSTMの紹介
 
学校選択問題のマッチング理論分析
学校選択問題のマッチング理論分析学校選択問題のマッチング理論分析
学校選択問題のマッチング理論分析
 
【論文読み会】Deep Clustering for Unsupervised Learning of Visual Features
【論文読み会】Deep Clustering for Unsupervised Learning of Visual Features【論文読み会】Deep Clustering for Unsupervised Learning of Visual Features
【論文読み会】Deep Clustering for Unsupervised Learning of Visual Features
 
coordinate descent 法について
coordinate descent 法についてcoordinate descent 法について
coordinate descent 法について
 
Dec cms arithmétiques
Dec cms arithmétiquesDec cms arithmétiques
Dec cms arithmétiques
 
PRML復々習レーン#2 2.3.6 - 2.3.7
PRML復々習レーン#2 2.3.6 - 2.3.7PRML復々習レーン#2 2.3.6 - 2.3.7
PRML復々習レーン#2 2.3.6 - 2.3.7
 
全脳アーキテクチャ若手の会 強化学習
全脳アーキテクチャ若手の会 強化学習全脳アーキテクチャ若手の会 強化学習
全脳アーキテクチャ若手の会 強化学習
 
データ解析のための統計モデリング入門9章後半
データ解析のための統計モデリング入門9章後半データ解析のための統計モデリング入門9章後半
データ解析のための統計モデリング入門9章後半
 
3次元レジストレーション(PCLデモとコード付き)
3次元レジストレーション(PCLデモとコード付き)3次元レジストレーション(PCLデモとコード付き)
3次元レジストレーション(PCLデモとコード付き)
 
パターン認識と機械学習6章(カーネル法)
パターン認識と機械学習6章(カーネル法)パターン認識と機械学習6章(カーネル法)
パターン認識と機械学習6章(カーネル法)
 
ゼロから作るDeepLearning 5章 輪読
ゼロから作るDeepLearning 5章 輪読ゼロから作るDeepLearning 5章 輪読
ゼロから作るDeepLearning 5章 輪読
 
点群深層学習 Meta-study
点群深層学習 Meta-study点群深層学習 Meta-study
点群深層学習 Meta-study
 
クラシックな機械学習の入門 4. 学習データと予測性能
クラシックな機械学習の入門  4.   学習データと予測性能クラシックな機械学習の入門  4.   学習データと予測性能
クラシックな機械学習の入門 4. 学習データと予測性能
 
Stock price prediction using k* nearest neighbors and indexing dynamic time w...
Stock price prediction using k* nearest neighbors and indexing dynamic time w...Stock price prediction using k* nearest neighbors and indexing dynamic time w...
Stock price prediction using k* nearest neighbors and indexing dynamic time w...
 

Destacado

Kmeans initialization
Kmeans initializationKmeans initialization
Kmeans initialization
djempol
 
K means Clustering
K means ClusteringK means Clustering
K means Clustering
Edureka!
 
Apache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and ProductionApache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and Production
Databricks
 
National geographicphotos2010
National geographicphotos2010National geographicphotos2010
National geographicphotos2010
Kostas Tampakis
 
La bella roma[1][1]._tno
La bella roma[1][1]._tnoLa bella roma[1][1]._tno
La bella roma[1][1]._tno
Kostas Tampakis
 
Lenny Koupal Writing Samples
Lenny Koupal Writing SamplesLenny Koupal Writing Samples
Lenny Koupal Writing Samples
Lenny Koupal
 
Zambia Capital Ask - draft
Zambia Capital Ask - draftZambia Capital Ask - draft
Zambia Capital Ask - draft
Andy Lehman
 

Destacado (20)

Kmeans initialization
Kmeans initializationKmeans initialization
Kmeans initialization
 
K mean-clustering algorithm
K mean-clustering algorithmK mean-clustering algorithm
K mean-clustering algorithm
 
K means Clustering Algorithm
K means Clustering AlgorithmK means Clustering Algorithm
K means Clustering Algorithm
 
PRML 9.1-9.2: K-means Clustering & Mixtures of Gaussians
PRML 9.1-9.2: K-means Clustering & Mixtures of GaussiansPRML 9.1-9.2: K-means Clustering & Mixtures of Gaussians
PRML 9.1-9.2: K-means Clustering & Mixtures of Gaussians
 
Kmeans
KmeansKmeans
Kmeans
 
MLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning LibraryMLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning Library
 
Introduction to ML with Apache Spark MLlib
Introduction to ML with Apache Spark MLlibIntroduction to ML with Apache Spark MLlib
Introduction to ML with Apache Spark MLlib
 
발표자료 11장
발표자료 11장발표자료 11장
발표자료 11장
 
K means Clustering
K means ClusteringK means Clustering
K means Clustering
 
Apriori Algorithm
Apriori AlgorithmApriori Algorithm
Apriori Algorithm
 
Apache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and ProductionApache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and Production
 
National geographicphotos2010
National geographicphotos2010National geographicphotos2010
National geographicphotos2010
 
La bella roma[1][1]._tno
La bella roma[1][1]._tnoLa bella roma[1][1]._tno
La bella roma[1][1]._tno
 
Socialmedia for business presentatie Hockeysocieteit 's-Hertogenbosch
Socialmedia for business presentatie Hockeysocieteit 's-HertogenboschSocialmedia for business presentatie Hockeysocieteit 's-Hertogenbosch
Socialmedia for business presentatie Hockeysocieteit 's-Hertogenbosch
 
Social Media Payments Opps and Challenges
Social Media Payments Opps and ChallengesSocial Media Payments Opps and Challenges
Social Media Payments Opps and Challenges
 
Foto surreali copia 21
Foto surreali copia 21Foto surreali copia 21
Foto surreali copia 21
 
Lenny Koupal Writing Samples
Lenny Koupal Writing SamplesLenny Koupal Writing Samples
Lenny Koupal Writing Samples
 
Et dieu crea_la_mer
Et dieu crea_la_merEt dieu crea_la_mer
Et dieu crea_la_mer
 
Zambia Capital Ask - draft
Zambia Capital Ask - draftZambia Capital Ask - draft
Zambia Capital Ask - draft
 
Laponsko
LaponskoLaponsko
Laponsko
 

Similar a Kmeans plusplus

ensembles_emptytemplate_v2
ensembles_emptytemplate_v2ensembles_emptytemplate_v2
ensembles_emptytemplate_v2
Shrayes Ramesh
 

Similar a Kmeans plusplus (20)

Study material ip class 12th
Study material ip class 12thStudy material ip class 12th
Study material ip class 12th
 
RUCK 2017 MxNet과 R을 연동한 딥러닝 소개
RUCK 2017 MxNet과 R을 연동한 딥러닝 소개RUCK 2017 MxNet과 R을 연동한 딥러닝 소개
RUCK 2017 MxNet과 R을 연동한 딥러닝 소개
 
Log Analytics in Datacenter with Apache Spark and Machine Learning
Log Analytics in Datacenter with Apache Spark and Machine LearningLog Analytics in Datacenter with Apache Spark and Machine Learning
Log Analytics in Datacenter with Apache Spark and Machine Learning
 
Log Analytics in Datacenter with Apache Spark and Machine Learning
Log Analytics in Datacenter with Apache Spark and Machine LearningLog Analytics in Datacenter with Apache Spark and Machine Learning
Log Analytics in Datacenter with Apache Spark and Machine Learning
 
Training machine learning k means 2017
Training machine learning k means 2017Training machine learning k means 2017
Training machine learning k means 2017
 
Topic Set Size Design with Variance Estimates from Two-Way ANOVA
Topic Set Size Design with Variance Estimates from Two-Way ANOVATopic Set Size Design with Variance Estimates from Two-Way ANOVA
Topic Set Size Design with Variance Estimates from Two-Way ANOVA
 
A Scalable Dataflow Implementation of Curran's Approximation Algorithm
A Scalable Dataflow Implementation of Curran's Approximation AlgorithmA Scalable Dataflow Implementation of Curran's Approximation Algorithm
A Scalable Dataflow Implementation of Curran's Approximation Algorithm
 
Clustering: A Scikit Learn Tutorial
Clustering: A Scikit Learn TutorialClustering: A Scikit Learn Tutorial
Clustering: A Scikit Learn Tutorial
 
"Quantum Clustering - Physics Inspired Clustering Algorithm", Sigalit Bechler...
"Quantum Clustering - Physics Inspired Clustering Algorithm", Sigalit Bechler..."Quantum Clustering - Physics Inspired Clustering Algorithm", Sigalit Bechler...
"Quantum Clustering - Physics Inspired Clustering Algorithm", Sigalit Bechler...
 
P1121133727
P1121133727P1121133727
P1121133727
 
ensembles_emptytemplate_v2
ensembles_emptytemplate_v2ensembles_emptytemplate_v2
ensembles_emptytemplate_v2
 
The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)
 
Mathematics online: some common algorithms
Mathematics online: some common algorithmsMathematics online: some common algorithms
Mathematics online: some common algorithms
 
Blinkdb
BlinkdbBlinkdb
Blinkdb
 
GALE: Geometric active learning for Search-Based Software Engineering
GALE: Geometric active learning for Search-Based Software EngineeringGALE: Geometric active learning for Search-Based Software Engineering
GALE: Geometric active learning for Search-Based Software Engineering
 
Software Testing:
 A Research Travelogue 
(2000–2014)
Software Testing:
 A Research Travelogue 
(2000–2014)Software Testing:
 A Research Travelogue 
(2000–2014)
Software Testing:
 A Research Travelogue 
(2000–2014)
 
Afsar ml applied_svm
Afsar ml applied_svmAfsar ml applied_svm
Afsar ml applied_svm
 
Cluster Analysis for Dummies
Cluster Analysis for DummiesCluster Analysis for Dummies
Cluster Analysis for Dummies
 
"Quantum clustering - physics inspired clustering algorithm", Sigalit Bechler...
"Quantum clustering - physics inspired clustering algorithm", Sigalit Bechler..."Quantum clustering - physics inspired clustering algorithm", Sigalit Bechler...
"Quantum clustering - physics inspired clustering algorithm", Sigalit Bechler...
 
More investment in Research and Development for better Education in the future?
More investment in Research and Development for better Education in the future?More investment in Research and Development for better Education in the future?
More investment in Research and Development for better Education in the future?
 

Último

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Último (20)

Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 

Kmeans plusplus

  • 1. K-means++ Seeding Algorithm, 
 Implementation in MLDemos! Renaud Richardet! Brain Mind Institute ! Ecole Polytechnique Fédérale 
 de Lausanne (EPFL), Switzerland! renaud.richardet@epfl.ch ! !
  • 2. K-means! •  K-means: widely used clustering technique! •  Initialization: blind random on input data! •  Drawback: very sensitive to choice of initial cluster centers (seeds)! •  Local optimal can be arbitrarily bad wrt. objective function, compared to global optimal clustering!
  • 3. K-means++! •  A seeding technique for k-means
 from Arthur and Vassilvitskii [2007]! •  Idea: spread the k initial cluster centers away from each other.! •  O(log k)-competitive with the optimal clustering" •  substantial convergence time speedups (empirical)!
  • 4. Algorithm! c  ∈  C:  cluster  center   x  ∈    X:  data  point   D(x):  distance  between  x  and  the  nearest  ck  that  has  already  been  chosen      
  • 5. Implementation! •  Based on Apache Commons Math’s KMeansPlusPlusClusterer and 
 Arthur’s [2007] implementation! •  Implemented directly in MLDemos’ core!
  • 6. Implementation Test Dataset: 4 squares (n=16)!
  • 7. Expected: 4 nice clusters!
  • 8. Sample Output!  1:  first  cluster  center  0  at  rand:  x=4  [-­‐2.0;  2.0]    1:  initial  minDist  for  0  [-­‐1.0;-­‐1.0]  =  10.0    1:  initial  minDist  for  1  [  2.0;  1.0]  =  17.0    1:  initial  minDist  for  2  [  1.0;-­‐1.0]  =  18.0    1:  initial  minDist  for  3  [-­‐1.0;-­‐2.0]  =  17.0    1:  initial  minDist  for  5  [  2.0;  2.0]  =  16.0    1:  initial  minDist  for  6  [  2.0;-­‐2.0]  =  32.0    1:  initial  minDist  for  7  [-­‐1.0;  2.0]  =    1.0    1:  initial  minDist  for  8  [-­‐2.0;-­‐2.0]  =  16.0    1:  initial  minDist  for  9  [  1.0;  1.0]  =  10.0    1:  initial  minDist  for  10[  2.0;-­‐1.0]  =  25.0    1:  initial  minDist  for  11[-­‐2.0;-­‐1.0]  =    9.0          […]    2:  picking  cluster  center  1  -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐    3:      distSqSum=3345.0    3:      random  index  1532.706909    4:    new  cluster  point:  x=6  [2.0;-­‐2.0]    
  • 9. Sample Output (2)!  4:      updating  minDist  for  0  [-­‐1.0;-­‐1.0]  =  10.0    4:      updating  minDist  for  1  [  2.0;  1.0]  =    9.0    4:      updating  minDist  for  2  [  1.0;-­‐1.0]  =    2.0    4:      updating  minDist  for  3  [-­‐1.0;-­‐2.0]  =    9.0    4:      updating  minDist  for  5  [  2.0;  2.0]  =  16.0    4:      updating  minDist  for  7  [-­‐1.0;  2.0]  =  25.0    4:      updating  minDist  for  8  [-­‐2.0;-­‐2.0]  =  16.0    4:      updating  minDist  for  9  [  1.0;  1.0]  =  10.0    4:      updating  minDist  for  10[2.0  ;-­‐1.0]  =    1.0    4:      updating  minDist  for  11[-­‐2.0;-­‐1.0]  =  17.0    […]    2:  picking  cluster  center  2  -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐    3:      distSqSum=961.0    3:      random  index  103.404701    4:      new  cluster  point:  x=1  [2.0;1.0]    4:      updating  minDist  for  0  [-­‐1.0;-­‐1.0]  =  13.0    […]  
  • 10. Evaluation on Test Dataset! •  200 clustering runs, each with and without k- means++ initialization! •  Measure RSS (intra-class variance)! •  K-means!
 optimal clustering 115 times (57.5%) ! •  K-means++ !
 optimal clustering 182 times (91%)!
  • 11. Comparison of the frequency distribution of RSS values between k-means and k-means ++ on the evaluation dataset (n=200)!
  • 12. Evaluation on Real Dataset! •  UCI’s Water Treatment Plant data set
 daily measures of sensors in an urban waste water treatment plant (n=396, d=38)! •  Sampled two times 500 clustering runs for k-means and k-means++ with k=13, and recorded RSS! •  Difference highly significant (P < 0.0001) !
  • 13. Comparison of the frequency distribution of RSS values between k-means and k-means ++ on the UCI real world dataset (n=500)!
  • 14. Alternatives Seeding Algorithms! •  Extensive research into seeding techniques for k- means.! •  Steinley [2007]: evaluated 12 different techniques (omitting k-means++). Recommends multiple random starting points for general use.! •  Maitra [2011] evaluated 11 techniques (including k- means++). Unable to provide recommendations when evaluating nine standard real-world datasets. ! •  Maitra analyzed simulated datasets and recommends using Milligan’s [1980] or Mirkin’s [2005] seeding technique, and Bradley’s [1998] when dataset is very large.!
  • 15. Conclusions and Future Work! •  Using a synthetic test dataset and a real world dataset, we showed that our implementation of the k-means++ seeding procedure in the MLDemos software package yields a significant reduction of the RSS. ! •  A short literature survey revealed that many seeding procedures exist for k-means, and that some alternatives to k-means++ might yield even larger improvements.!
  • 16. References! •  Arthur, D. & Vassilvitskii, S.: “k-means++: The advantages of careful seeding”. Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms 1027–1035 (2007).! •  Bahmani, B., Moseley, B., Vattani, A., Kumar, R. & Vassilvitskii, S.: “Scalable K-Means+”. Unpublished working paper available at http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf (2012).! •  Bradley P. S. & Fayyad U. M.: “Refining initial points. for K-Means clustering”. Proc. 15th International Conf. on Machine Learning, 91-99 (1998).! •  Maitra, R., Peterson, A. D. & Ghosh, A. P.: “A systematic evaluation of different methods for initializing the K-means clustering algorithm”. Unpublished working paper available at http://apghosh.public.iastate.edu/ files/IEEEclust2.pdf (2011).! •  Milligan G. W.: “The validation of four ultrametric clustering algorithms”. Pattern Recognition, vol. 12, 41–50 (1980). ! •  Mirkin B.: “Clustering for data mining: A data recovery approach”. Chapman and Hall (2005). ! •  Steinley, D. & Brusco, M. J.: “Initializing k-means batch clustering: A critical evaluation of several techniques”. Journal of Classification 24, 99–121 (2007).!