SlideShare una empresa de Scribd logo
1 de 16
Descargar para leer sin conexión
K-means++ Seeding Algorithm, 

 Implementation in MLDemos!

             Renaud Richardet!
             Brain Mind Institute !
       Ecole Polytechnique Fédérale 

      de Lausanne (EPFL), Switzerland!
          renaud.richardet@epfl.ch !
                      !
K-means!
•  K-means: widely used clustering technique!
•  Initialization: blind random on input data!
•  Drawback: very sensitive to choice of initial cluster
   centers (seeds)!
•  Local optimal can be arbitrarily bad wrt. objective
   function, compared to global optimal clustering!
K-means++!
•  A seeding technique for k-means

   from Arthur and Vassilvitskii [2007]!
•  Idea: spread the k initial cluster centers away from
   each other.!
•  O(log k)-competitive with the optimal clustering"
•  substantial convergence time speedups (empirical)!
Algorithm!




c	
  ∈	
  C:	
  cluster	
  center	
  
x	
  ∈	
  	
  X:	
  data	
  point	
  
D(x):	
  distance	
  between	
  x	
  and	
  the	
  nearest	
  ck	
  that	
  has	
  already	
  been	
  chosen	
  	
  
	
  
Implementation!
•  Based on Apache Commons Math’s
   KMeansPlusPlusClusterer and 

   Arthur’s [2007] implementation!
•  Implemented directly in MLDemos’ core!
Implementation Test Dataset: 4 squares (n=16)!
Expected: 4 nice clusters!
Sample Output!
	
  1:	
  first	
  cluster	
  center	
  0	
  at	
  rand:	
  x=4	
  [-­‐2.0;	
  2.0]	
  
	
  1:	
  initial	
  minDist	
  for	
  0	
  [-­‐1.0;-­‐1.0]	
  =	
  10.0	
  
	
  1:	
  initial	
  minDist	
  for	
  1	
  [	
  2.0;	
  1.0]	
  =	
  17.0	
  
	
  1:	
  initial	
  minDist	
  for	
  2	
  [	
  1.0;-­‐1.0]	
  =	
  18.0	
  
	
  1:	
  initial	
  minDist	
  for	
  3	
  [-­‐1.0;-­‐2.0]	
  =	
  17.0	
  
	
  1:	
  initial	
  minDist	
  for	
  5	
  [	
  2.0;	
  2.0]	
  =	
  16.0	
  
	
  1:	
  initial	
  minDist	
  for	
  6	
  [	
  2.0;-­‐2.0]	
  =	
  32.0	
  
	
  1:	
  initial	
  minDist	
  for	
  7	
  [-­‐1.0;	
  2.0]	
  =	
  	
  1.0	
  
	
  1:	
  initial	
  minDist	
  for	
  8	
  [-­‐2.0;-­‐2.0]	
  =	
  16.0	
  
	
  1:	
  initial	
  minDist	
  for	
  9	
  [	
  1.0;	
  1.0]	
  =	
  10.0	
  
	
  1:	
  initial	
  minDist	
  for	
  10[	
  2.0;-­‐1.0]	
  =	
  25.0	
  
	
  1:	
  initial	
  minDist	
  for	
  11[-­‐2.0;-­‐1.0]	
  =	
  	
  9.0	
  
	
  	
  	
  	
  […]	
  
	
  2:	
  picking	
  cluster	
  center	
  1	
  -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐	
  
	
  3:	
  	
  	
  distSqSum=3345.0	
  
	
  3:	
  	
  	
  random	
  index	
  1532.706909	
  
	
  4:	
  	
  new	
  cluster	
  point:	
  x=6	
  [2.0;-­‐2.0]	
  	
  
Sample Output (2)!
	
  4:	
  	
  	
  updating	
  minDist	
  for	
  0	
  [-­‐1.0;-­‐1.0]	
  =	
  10.0	
  
	
  4:	
  	
  	
  updating	
  minDist	
  for	
  1	
  [	
  2.0;	
  1.0]	
  =	
  	
  9.0	
  
	
  4:	
  	
  	
  updating	
  minDist	
  for	
  2	
  [	
  1.0;-­‐1.0]	
  =	
  	
  2.0	
  
	
  4:	
  	
  	
  updating	
  minDist	
  for	
  3	
  [-­‐1.0;-­‐2.0]	
  =	
  	
  9.0	
  
	
  4:	
  	
  	
  updating	
  minDist	
  for	
  5	
  [	
  2.0;	
  2.0]	
  =	
  16.0	
  
	
  4:	
  	
  	
  updating	
  minDist	
  for	
  7	
  [-­‐1.0;	
  2.0]	
  =	
  25.0	
  
	
  4:	
  	
  	
  updating	
  minDist	
  for	
  8	
  [-­‐2.0;-­‐2.0]	
  =	
  16.0	
  
	
  4:	
  	
  	
  updating	
  minDist	
  for	
  9	
  [	
  1.0;	
  1.0]	
  =	
  10.0	
  
	
  4:	
  	
  	
  updating	
  minDist	
  for	
  10[2.0	
  ;-­‐1.0]	
  =	
  	
  1.0	
  
	
  4:	
  	
  	
  updating	
  minDist	
  for	
  11[-­‐2.0;-­‐1.0]	
  =	
  17.0	
  
              	
  […]	
  
	
  2:	
  picking	
  cluster	
  center	
  2	
  -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐	
  
	
  3:	
  	
  	
  distSqSum=961.0	
  
	
  3:	
  	
  	
  random	
  index	
  103.404701	
  
	
  4:	
  	
  	
  new	
  cluster	
  point:	
  x=1	
  [2.0;1.0]	
  
	
  4:	
  	
  	
  updating	
  minDist	
  for	
  0	
  [-­‐1.0;-­‐1.0]	
  =	
  13.0	
  
	
  […]	
  
Evaluation on Test Dataset!
•  200 clustering runs, each with and without k-
   means++ initialization!
•  Measure RSS (intra-class variance)!

•  K-means!

   optimal clustering 115 times (57.5%) !
•  K-means++ !

   optimal clustering 182 times (91%)!
Comparison of the frequency distribution of
RSS values between k-means and k-means
++ on the evaluation dataset (n=200)!
Evaluation on Real Dataset!
•  UCI’s Water Treatment Plant data set

   daily measures of sensors in an urban waste water
   treatment plant (n=396, d=38)!
•  Sampled two times 500 clustering runs for k-means
   and k-means++ with k=13, and recorded RSS!




•  Difference highly significant (P < 0.0001) !
Comparison of the frequency distribution of
RSS values between k-means and k-means
++ on the UCI real world dataset (n=500)!
Alternatives Seeding Algorithms!
•  Extensive research into seeding techniques for k-
   means.!
•  Steinley [2007]: evaluated 12 different techniques
   (omitting k-means++). Recommends multiple
   random starting points for general use.!
•  Maitra [2011] evaluated 11 techniques (including k-
   means++). Unable to provide recommendations
   when evaluating nine standard real-world datasets. !
•  Maitra analyzed simulated datasets and
   recommends using Milligan’s [1980] or Mirkin’s
   [2005] seeding technique, and Bradley’s [1998]
   when dataset is very large.!
Conclusions and Future Work!
•  Using a synthetic test dataset and a real world
   dataset, we showed that our implementation of
   the k-means++ seeding procedure in the
   MLDemos software package yields a significant
   reduction of the RSS. !
•  A short literature survey revealed that many
   seeding procedures exist for k-means, and that
   some alternatives to k-means++ might yield
   even larger improvements.!
References!
•    Arthur, D. & Vassilvitskii, S.: “k-means++: The advantages of careful
     seeding”. Proceedings of the eighteenth annual ACM-SIAM symposium on
     Discrete algorithms 1027–1035 (2007).!
•    Bahmani, B., Moseley, B., Vattani, A., Kumar, R. & Vassilvitskii, S.: “Scalable
     K-Means+”. Unpublished working paper available at
     http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf (2012).!
•    Bradley P. S. & Fayyad U. M.: “Refining initial points. for K-Means
     clustering”. Proc. 15th International Conf. on Machine Learning, 91-99
     (1998).!
•    Maitra, R., Peterson, A. D. & Ghosh, A. P.: “A systematic evaluation of
     different methods for initializing the K-means clustering algorithm”.
     Unpublished working paper available at http://apghosh.public.iastate.edu/
     files/IEEEclust2.pdf (2011).!
•    Milligan G. W.: “The validation of four ultrametric clustering algorithms”.
     Pattern Recognition, vol. 12, 41–50 (1980). !
•    Mirkin B.: “Clustering for data mining: A data recovery approach”. Chapman
     and Hall (2005). !
•    Steinley, D. & Brusco, M. J.: “Initializing k-means batch clustering: A critical
     evaluation of several techniques”. Journal of Classification 24, 99–121
     (2007).!

Más contenido relacionado

La actualidad más candente

Mlp mixer an all-mlp architecture for vision
Mlp mixer  an all-mlp architecture for visionMlp mixer  an all-mlp architecture for vision
Mlp mixer an all-mlp architecture for visionJaey Jeong
 
PR-317: MLP-Mixer: An all-MLP Architecture for Vision
PR-317: MLP-Mixer: An all-MLP Architecture for VisionPR-317: MLP-Mixer: An all-MLP Architecture for Vision
PR-317: MLP-Mixer: An all-MLP Architecture for VisionJinwon Lee
 
Yoav Goldberg: Word Embeddings What, How and Whither
Yoav Goldberg: Word Embeddings What, How and WhitherYoav Goldberg: Word Embeddings What, How and Whither
Yoav Goldberg: Word Embeddings What, How and WhitherMLReview
 
Quantum Knowledge Proofs and Post Quantum Cryptography - A Primer
Quantum Knowledge Proofs and Post Quantum Cryptography - A PrimerQuantum Knowledge Proofs and Post Quantum Cryptography - A Primer
Quantum Knowledge Proofs and Post Quantum Cryptography - A PrimerGokul Alex
 
Lecture 2 - Bit vs Qubits.pptx
Lecture 2 - Bit vs Qubits.pptxLecture 2 - Bit vs Qubits.pptx
Lecture 2 - Bit vs Qubits.pptxNatKell
 
The Beauty Of Math
The Beauty Of MathThe Beauty Of Math
The Beauty Of Mathpaulcaspe
 
MongoDB for Coder Training (Coding Serbia 2013)
MongoDB for Coder Training (Coding Serbia 2013)MongoDB for Coder Training (Coding Serbia 2013)
MongoDB for Coder Training (Coding Serbia 2013)Uwe Printz
 
K means clustering
K means clusteringK means clustering
K means clusteringThomas K T
 
Math magic presentation
Math magic presentationMath magic presentation
Math magic presentationjasmi jaafar
 
Topological Data Analysis and Persistent Homology
Topological Data Analysis and Persistent HomologyTopological Data Analysis and Persistent Homology
Topological Data Analysis and Persistent HomologyCarla Melia
 
[Webinar] Performance e otimização de banco de dados MySQL
[Webinar] Performance e otimização de banco de dados MySQL[Webinar] Performance e otimização de banco de dados MySQL
[Webinar] Performance e otimização de banco de dados MySQLKingHost - Hospedagem de sites
 
Quantum_Safe_Crypto_Overview_v3.pdf
Quantum_Safe_Crypto_Overview_v3.pdfQuantum_Safe_Crypto_Overview_v3.pdf
Quantum_Safe_Crypto_Overview_v3.pdfRonSteinfeld1
 
(Qraft)naver pitching
(Qraft)naver pitching(Qraft)naver pitching
(Qraft)naver pitching형식 김
 
[딥논읽] Meta-Transfer Learning for Zero-Shot Super-Resolution paper review
[딥논읽] Meta-Transfer Learning for Zero-Shot Super-Resolution paper review[딥논읽] Meta-Transfer Learning for Zero-Shot Super-Resolution paper review
[딥논읽] Meta-Transfer Learning for Zero-Shot Super-Resolution paper reviewtaeseon ryu
 

La actualidad más candente (20)

Mlp mixer an all-mlp architecture for vision
Mlp mixer  an all-mlp architecture for visionMlp mixer  an all-mlp architecture for vision
Mlp mixer an all-mlp architecture for vision
 
PR-317: MLP-Mixer: An all-MLP Architecture for Vision
PR-317: MLP-Mixer: An all-MLP Architecture for VisionPR-317: MLP-Mixer: An all-MLP Architecture for Vision
PR-317: MLP-Mixer: An all-MLP Architecture for Vision
 
Kruskal Algorithm
Kruskal AlgorithmKruskal Algorithm
Kruskal Algorithm
 
Yoav Goldberg: Word Embeddings What, How and Whither
Yoav Goldberg: Word Embeddings What, How and WhitherYoav Goldberg: Word Embeddings What, How and Whither
Yoav Goldberg: Word Embeddings What, How and Whither
 
Kruskal Algorithm
Kruskal AlgorithmKruskal Algorithm
Kruskal Algorithm
 
Quantum Knowledge Proofs and Post Quantum Cryptography - A Primer
Quantum Knowledge Proofs and Post Quantum Cryptography - A PrimerQuantum Knowledge Proofs and Post Quantum Cryptography - A Primer
Quantum Knowledge Proofs and Post Quantum Cryptography - A Primer
 
Lecture 2 - Bit vs Qubits.pptx
Lecture 2 - Bit vs Qubits.pptxLecture 2 - Bit vs Qubits.pptx
Lecture 2 - Bit vs Qubits.pptx
 
Backtracking
BacktrackingBacktracking
Backtracking
 
The Beauty Of Math
The Beauty Of MathThe Beauty Of Math
The Beauty Of Math
 
MongoDB for Coder Training (Coding Serbia 2013)
MongoDB for Coder Training (Coding Serbia 2013)MongoDB for Coder Training (Coding Serbia 2013)
MongoDB for Coder Training (Coding Serbia 2013)
 
K means clustering
K means clusteringK means clustering
K means clustering
 
PostgreSQL
PostgreSQLPostgreSQL
PostgreSQL
 
Math magic presentation
Math magic presentationMath magic presentation
Math magic presentation
 
Topological Data Analysis and Persistent Homology
Topological Data Analysis and Persistent HomologyTopological Data Analysis and Persistent Homology
Topological Data Analysis and Persistent Homology
 
Logstash
LogstashLogstash
Logstash
 
[Webinar] Performance e otimização de banco de dados MySQL
[Webinar] Performance e otimização de banco de dados MySQL[Webinar] Performance e otimização de banco de dados MySQL
[Webinar] Performance e otimização de banco de dados MySQL
 
Quantum_Safe_Crypto_Overview_v3.pdf
Quantum_Safe_Crypto_Overview_v3.pdfQuantum_Safe_Crypto_Overview_v3.pdf
Quantum_Safe_Crypto_Overview_v3.pdf
 
(Qraft)naver pitching
(Qraft)naver pitching(Qraft)naver pitching
(Qraft)naver pitching
 
[딥논읽] Meta-Transfer Learning for Zero-Shot Super-Resolution paper review
[딥논읽] Meta-Transfer Learning for Zero-Shot Super-Resolution paper review[딥논읽] Meta-Transfer Learning for Zero-Shot Super-Resolution paper review
[딥논읽] Meta-Transfer Learning for Zero-Shot Super-Resolution paper review
 
Transformers 101
Transformers 101 Transformers 101
Transformers 101
 

Destacado

Kmeans initialization
Kmeans initializationKmeans initialization
Kmeans initializationdjempol
 
K mean-clustering algorithm
K mean-clustering algorithmK mean-clustering algorithm
K mean-clustering algorithmparry prabhu
 
PRML 9.1-9.2: K-means Clustering & Mixtures of Gaussians
PRML 9.1-9.2: K-means Clustering & Mixtures of GaussiansPRML 9.1-9.2: K-means Clustering & Mixtures of Gaussians
PRML 9.1-9.2: K-means Clustering & Mixtures of GaussiansShinichi Tamura
 
Kmeans
KmeansKmeans
KmeansWagner
 
MLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning LibraryMLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning Libraryjeykottalam
 
Introduction to ML with Apache Spark MLlib
Introduction to ML with Apache Spark MLlibIntroduction to ML with Apache Spark MLlib
Introduction to ML with Apache Spark MLlibTaras Matyashovsky
 
발표자료 11장
발표자료 11장발표자료 11장
발표자료 11장Juhui Park
 
K means Clustering
K means ClusteringK means Clustering
K means ClusteringEdureka!
 
Apache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and ProductionApache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and ProductionDatabricks
 
National geographicphotos2010
National geographicphotos2010National geographicphotos2010
National geographicphotos2010Kostas Tampakis
 
La bella roma[1][1]._tno
La bella roma[1][1]._tnoLa bella roma[1][1]._tno
La bella roma[1][1]._tnoKostas Tampakis
 
Socialmedia for business presentatie Hockeysocieteit 's-Hertogenbosch
Socialmedia for business presentatie Hockeysocieteit 's-HertogenboschSocialmedia for business presentatie Hockeysocieteit 's-Hertogenbosch
Socialmedia for business presentatie Hockeysocieteit 's-HertogenboschMarcel Rietveld ✔
 
Lenny Koupal Writing Samples
Lenny Koupal Writing SamplesLenny Koupal Writing Samples
Lenny Koupal Writing SamplesLenny Koupal
 
Zambia Capital Ask - draft
Zambia Capital Ask - draftZambia Capital Ask - draft
Zambia Capital Ask - draftAndy Lehman
 

Destacado (20)

Kmeans initialization
Kmeans initializationKmeans initialization
Kmeans initialization
 
K mean-clustering algorithm
K mean-clustering algorithmK mean-clustering algorithm
K mean-clustering algorithm
 
K means Clustering Algorithm
K means Clustering AlgorithmK means Clustering Algorithm
K means Clustering Algorithm
 
PRML 9.1-9.2: K-means Clustering & Mixtures of Gaussians
PRML 9.1-9.2: K-means Clustering & Mixtures of GaussiansPRML 9.1-9.2: K-means Clustering & Mixtures of Gaussians
PRML 9.1-9.2: K-means Clustering & Mixtures of Gaussians
 
Kmeans
KmeansKmeans
Kmeans
 
MLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning LibraryMLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning Library
 
Introduction to ML with Apache Spark MLlib
Introduction to ML with Apache Spark MLlibIntroduction to ML with Apache Spark MLlib
Introduction to ML with Apache Spark MLlib
 
발표자료 11장
발표자료 11장발표자료 11장
발표자료 11장
 
K means Clustering
K means ClusteringK means Clustering
K means Clustering
 
Apriori Algorithm
Apriori AlgorithmApriori Algorithm
Apriori Algorithm
 
Apache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and ProductionApache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and Production
 
National geographicphotos2010
National geographicphotos2010National geographicphotos2010
National geographicphotos2010
 
La bella roma[1][1]._tno
La bella roma[1][1]._tnoLa bella roma[1][1]._tno
La bella roma[1][1]._tno
 
Socialmedia for business presentatie Hockeysocieteit 's-Hertogenbosch
Socialmedia for business presentatie Hockeysocieteit 's-HertogenboschSocialmedia for business presentatie Hockeysocieteit 's-Hertogenbosch
Socialmedia for business presentatie Hockeysocieteit 's-Hertogenbosch
 
Social Media Payments Opps and Challenges
Social Media Payments Opps and ChallengesSocial Media Payments Opps and Challenges
Social Media Payments Opps and Challenges
 
Foto surreali copia 21
Foto surreali copia 21Foto surreali copia 21
Foto surreali copia 21
 
Lenny Koupal Writing Samples
Lenny Koupal Writing SamplesLenny Koupal Writing Samples
Lenny Koupal Writing Samples
 
Et dieu crea_la_mer
Et dieu crea_la_merEt dieu crea_la_mer
Et dieu crea_la_mer
 
Zambia Capital Ask - draft
Zambia Capital Ask - draftZambia Capital Ask - draft
Zambia Capital Ask - draft
 
Laponsko
LaponskoLaponsko
Laponsko
 

Similar a K-Means++ Seeding Algorithm Implementation

Study material ip class 12th
Study material ip class 12thStudy material ip class 12th
Study material ip class 12thanimesh dwivedi
 
RUCK 2017 MxNet과 R을 연동한 딥러닝 소개
RUCK 2017 MxNet과 R을 연동한 딥러닝 소개RUCK 2017 MxNet과 R을 연동한 딥러닝 소개
RUCK 2017 MxNet과 R을 연동한 딥러닝 소개r-kor
 
Log Analytics in Datacenter with Apache Spark and Machine Learning
Log Analytics in Datacenter with Apache Spark and Machine LearningLog Analytics in Datacenter with Apache Spark and Machine Learning
Log Analytics in Datacenter with Apache Spark and Machine LearningPiotr Tylenda
 
Log Analytics in Datacenter with Apache Spark and Machine Learning
Log Analytics in Datacenter with Apache Spark and Machine LearningLog Analytics in Datacenter with Apache Spark and Machine Learning
Log Analytics in Datacenter with Apache Spark and Machine LearningAgnieszka Potulska
 
Training machine learning k means 2017
Training machine learning k means 2017Training machine learning k means 2017
Training machine learning k means 2017Iwan Sofana
 
Topic Set Size Design with Variance Estimates from Two-Way ANOVA
Topic Set Size Design with Variance Estimates from Two-Way ANOVATopic Set Size Design with Variance Estimates from Two-Way ANOVA
Topic Set Size Design with Variance Estimates from Two-Way ANOVATetsuya Sakai
 
A Scalable Dataflow Implementation of Curran's Approximation Algorithm
A Scalable Dataflow Implementation of Curran's Approximation AlgorithmA Scalable Dataflow Implementation of Curran's Approximation Algorithm
A Scalable Dataflow Implementation of Curran's Approximation AlgorithmNECST Lab @ Politecnico di Milano
 
"Quantum Clustering - Physics Inspired Clustering Algorithm", Sigalit Bechler...
"Quantum Clustering - Physics Inspired Clustering Algorithm", Sigalit Bechler..."Quantum Clustering - Physics Inspired Clustering Algorithm", Sigalit Bechler...
"Quantum Clustering - Physics Inspired Clustering Algorithm", Sigalit Bechler...Dataconomy Media
 
ensembles_emptytemplate_v2
ensembles_emptytemplate_v2ensembles_emptytemplate_v2
ensembles_emptytemplate_v2Shrayes Ramesh
 
The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)theijes
 
Mathematics online: some common algorithms
Mathematics online: some common algorithmsMathematics online: some common algorithms
Mathematics online: some common algorithmsMark Moriarty
 
GALE: Geometric active learning for Search-Based Software Engineering
GALE: Geometric active learning for Search-Based Software EngineeringGALE: Geometric active learning for Search-Based Software Engineering
GALE: Geometric active learning for Search-Based Software EngineeringCS, NcState
 
Software Testing:
 A Research Travelogue 
(2000–2014)
Software Testing:
 A Research Travelogue 
(2000–2014)Software Testing:
 A Research Travelogue 
(2000–2014)
Software Testing:
 A Research Travelogue 
(2000–2014)Alex Orso
 
Afsar ml applied_svm
Afsar ml applied_svmAfsar ml applied_svm
Afsar ml applied_svmUmmeHaniAsif
 
"Quantum clustering - physics inspired clustering algorithm", Sigalit Bechler...
"Quantum clustering - physics inspired clustering algorithm", Sigalit Bechler..."Quantum clustering - physics inspired clustering algorithm", Sigalit Bechler...
"Quantum clustering - physics inspired clustering algorithm", Sigalit Bechler...Dataconomy Media
 
More investment in Research and Development for better Education in the future?
More investment in Research and Development for better Education in the future?More investment in Research and Development for better Education in the future?
More investment in Research and Development for better Education in the future?Dhafer Malouche
 

Similar a K-Means++ Seeding Algorithm Implementation (20)

Study material ip class 12th
Study material ip class 12thStudy material ip class 12th
Study material ip class 12th
 
RUCK 2017 MxNet과 R을 연동한 딥러닝 소개
RUCK 2017 MxNet과 R을 연동한 딥러닝 소개RUCK 2017 MxNet과 R을 연동한 딥러닝 소개
RUCK 2017 MxNet과 R을 연동한 딥러닝 소개
 
Log Analytics in Datacenter with Apache Spark and Machine Learning
Log Analytics in Datacenter with Apache Spark and Machine LearningLog Analytics in Datacenter with Apache Spark and Machine Learning
Log Analytics in Datacenter with Apache Spark and Machine Learning
 
Log Analytics in Datacenter with Apache Spark and Machine Learning
Log Analytics in Datacenter with Apache Spark and Machine LearningLog Analytics in Datacenter with Apache Spark and Machine Learning
Log Analytics in Datacenter with Apache Spark and Machine Learning
 
Training machine learning k means 2017
Training machine learning k means 2017Training machine learning k means 2017
Training machine learning k means 2017
 
Topic Set Size Design with Variance Estimates from Two-Way ANOVA
Topic Set Size Design with Variance Estimates from Two-Way ANOVATopic Set Size Design with Variance Estimates from Two-Way ANOVA
Topic Set Size Design with Variance Estimates from Two-Way ANOVA
 
A Scalable Dataflow Implementation of Curran's Approximation Algorithm
A Scalable Dataflow Implementation of Curran's Approximation AlgorithmA Scalable Dataflow Implementation of Curran's Approximation Algorithm
A Scalable Dataflow Implementation of Curran's Approximation Algorithm
 
Clustering: A Scikit Learn Tutorial
Clustering: A Scikit Learn TutorialClustering: A Scikit Learn Tutorial
Clustering: A Scikit Learn Tutorial
 
"Quantum Clustering - Physics Inspired Clustering Algorithm", Sigalit Bechler...
"Quantum Clustering - Physics Inspired Clustering Algorithm", Sigalit Bechler..."Quantum Clustering - Physics Inspired Clustering Algorithm", Sigalit Bechler...
"Quantum Clustering - Physics Inspired Clustering Algorithm", Sigalit Bechler...
 
P1121133727
P1121133727P1121133727
P1121133727
 
ensembles_emptytemplate_v2
ensembles_emptytemplate_v2ensembles_emptytemplate_v2
ensembles_emptytemplate_v2
 
The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)
 
Mathematics online: some common algorithms
Mathematics online: some common algorithmsMathematics online: some common algorithms
Mathematics online: some common algorithms
 
Blinkdb
BlinkdbBlinkdb
Blinkdb
 
GALE: Geometric active learning for Search-Based Software Engineering
GALE: Geometric active learning for Search-Based Software EngineeringGALE: Geometric active learning for Search-Based Software Engineering
GALE: Geometric active learning for Search-Based Software Engineering
 
Software Testing:
 A Research Travelogue 
(2000–2014)
Software Testing:
 A Research Travelogue 
(2000–2014)Software Testing:
 A Research Travelogue 
(2000–2014)
Software Testing:
 A Research Travelogue 
(2000–2014)
 
Afsar ml applied_svm
Afsar ml applied_svmAfsar ml applied_svm
Afsar ml applied_svm
 
Cluster Analysis for Dummies
Cluster Analysis for DummiesCluster Analysis for Dummies
Cluster Analysis for Dummies
 
"Quantum clustering - physics inspired clustering algorithm", Sigalit Bechler...
"Quantum clustering - physics inspired clustering algorithm", Sigalit Bechler..."Quantum clustering - physics inspired clustering algorithm", Sigalit Bechler...
"Quantum clustering - physics inspired clustering algorithm", Sigalit Bechler...
 
More investment in Research and Development for better Education in the future?
More investment in Research and Development for better Education in the future?More investment in Research and Development for better Education in the future?
More investment in Research and Development for better Education in the future?
 

Último

Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 

Último (20)

Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 

K-Means++ Seeding Algorithm Implementation

  • 1. K-means++ Seeding Algorithm, 
 Implementation in MLDemos! Renaud Richardet! Brain Mind Institute ! Ecole Polytechnique Fédérale 
 de Lausanne (EPFL), Switzerland! renaud.richardet@epfl.ch ! !
  • 2. K-means! •  K-means: widely used clustering technique! •  Initialization: blind random on input data! •  Drawback: very sensitive to choice of initial cluster centers (seeds)! •  Local optimal can be arbitrarily bad wrt. objective function, compared to global optimal clustering!
  • 3. K-means++! •  A seeding technique for k-means
 from Arthur and Vassilvitskii [2007]! •  Idea: spread the k initial cluster centers away from each other.! •  O(log k)-competitive with the optimal clustering" •  substantial convergence time speedups (empirical)!
  • 4. Algorithm! c  ∈  C:  cluster  center   x  ∈    X:  data  point   D(x):  distance  between  x  and  the  nearest  ck  that  has  already  been  chosen      
  • 5. Implementation! •  Based on Apache Commons Math’s KMeansPlusPlusClusterer and 
 Arthur’s [2007] implementation! •  Implemented directly in MLDemos’ core!
  • 6. Implementation Test Dataset: 4 squares (n=16)!
  • 7. Expected: 4 nice clusters!
  • 8. Sample Output!  1:  first  cluster  center  0  at  rand:  x=4  [-­‐2.0;  2.0]    1:  initial  minDist  for  0  [-­‐1.0;-­‐1.0]  =  10.0    1:  initial  minDist  for  1  [  2.0;  1.0]  =  17.0    1:  initial  minDist  for  2  [  1.0;-­‐1.0]  =  18.0    1:  initial  minDist  for  3  [-­‐1.0;-­‐2.0]  =  17.0    1:  initial  minDist  for  5  [  2.0;  2.0]  =  16.0    1:  initial  minDist  for  6  [  2.0;-­‐2.0]  =  32.0    1:  initial  minDist  for  7  [-­‐1.0;  2.0]  =    1.0    1:  initial  minDist  for  8  [-­‐2.0;-­‐2.0]  =  16.0    1:  initial  minDist  for  9  [  1.0;  1.0]  =  10.0    1:  initial  minDist  for  10[  2.0;-­‐1.0]  =  25.0    1:  initial  minDist  for  11[-­‐2.0;-­‐1.0]  =    9.0          […]    2:  picking  cluster  center  1  -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐    3:      distSqSum=3345.0    3:      random  index  1532.706909    4:    new  cluster  point:  x=6  [2.0;-­‐2.0]    
  • 9. Sample Output (2)!  4:      updating  minDist  for  0  [-­‐1.0;-­‐1.0]  =  10.0    4:      updating  minDist  for  1  [  2.0;  1.0]  =    9.0    4:      updating  minDist  for  2  [  1.0;-­‐1.0]  =    2.0    4:      updating  minDist  for  3  [-­‐1.0;-­‐2.0]  =    9.0    4:      updating  minDist  for  5  [  2.0;  2.0]  =  16.0    4:      updating  minDist  for  7  [-­‐1.0;  2.0]  =  25.0    4:      updating  minDist  for  8  [-­‐2.0;-­‐2.0]  =  16.0    4:      updating  minDist  for  9  [  1.0;  1.0]  =  10.0    4:      updating  minDist  for  10[2.0  ;-­‐1.0]  =    1.0    4:      updating  minDist  for  11[-­‐2.0;-­‐1.0]  =  17.0    […]    2:  picking  cluster  center  2  -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐    3:      distSqSum=961.0    3:      random  index  103.404701    4:      new  cluster  point:  x=1  [2.0;1.0]    4:      updating  minDist  for  0  [-­‐1.0;-­‐1.0]  =  13.0    […]  
  • 10. Evaluation on Test Dataset! •  200 clustering runs, each with and without k- means++ initialization! •  Measure RSS (intra-class variance)! •  K-means!
 optimal clustering 115 times (57.5%) ! •  K-means++ !
 optimal clustering 182 times (91%)!
  • 11. Comparison of the frequency distribution of RSS values between k-means and k-means ++ on the evaluation dataset (n=200)!
  • 12. Evaluation on Real Dataset! •  UCI’s Water Treatment Plant data set
 daily measures of sensors in an urban waste water treatment plant (n=396, d=38)! •  Sampled two times 500 clustering runs for k-means and k-means++ with k=13, and recorded RSS! •  Difference highly significant (P < 0.0001) !
  • 13. Comparison of the frequency distribution of RSS values between k-means and k-means ++ on the UCI real world dataset (n=500)!
  • 14. Alternatives Seeding Algorithms! •  Extensive research into seeding techniques for k- means.! •  Steinley [2007]: evaluated 12 different techniques (omitting k-means++). Recommends multiple random starting points for general use.! •  Maitra [2011] evaluated 11 techniques (including k- means++). Unable to provide recommendations when evaluating nine standard real-world datasets. ! •  Maitra analyzed simulated datasets and recommends using Milligan’s [1980] or Mirkin’s [2005] seeding technique, and Bradley’s [1998] when dataset is very large.!
  • 15. Conclusions and Future Work! •  Using a synthetic test dataset and a real world dataset, we showed that our implementation of the k-means++ seeding procedure in the MLDemos software package yields a significant reduction of the RSS. ! •  A short literature survey revealed that many seeding procedures exist for k-means, and that some alternatives to k-means++ might yield even larger improvements.!
  • 16. References! •  Arthur, D. & Vassilvitskii, S.: “k-means++: The advantages of careful seeding”. Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms 1027–1035 (2007).! •  Bahmani, B., Moseley, B., Vattani, A., Kumar, R. & Vassilvitskii, S.: “Scalable K-Means+”. Unpublished working paper available at http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf (2012).! •  Bradley P. S. & Fayyad U. M.: “Refining initial points. for K-Means clustering”. Proc. 15th International Conf. on Machine Learning, 91-99 (1998).! •  Maitra, R., Peterson, A. D. & Ghosh, A. P.: “A systematic evaluation of different methods for initializing the K-means clustering algorithm”. Unpublished working paper available at http://apghosh.public.iastate.edu/ files/IEEEclust2.pdf (2011).! •  Milligan G. W.: “The validation of four ultrametric clustering algorithms”. Pattern Recognition, vol. 12, 41–50 (1980). ! •  Mirkin B.: “Clustering for data mining: A data recovery approach”. Chapman and Hall (2005). ! •  Steinley, D. & Brusco, M. J.: “Initializing k-means batch clustering: A critical evaluation of several techniques”. Journal of Classification 24, 99–121 (2007).!