SlideShare una empresa de Scribd logo
1 de 40
Descargar para leer sin conexión
Data Analysis Course
Cluster Analysis
Venkat Reddy
Contents
• What is the need of Segmentation
• Introduction to Segmentation & Cluster analysis
• Applications of Cluster Analysis
• Types of Clusters
• K-Means clustering
DataAnalysisCourse
VenkatReddy
2
What is the need of segmentation?
Problem:
• 10,000 Customers - we know their age, city name, income,
employment status, designation
• You have to sell 100 Blackberry phones(each costs $1000) to
the people in this group. You have maximum of 7 days
• If you start giving demos to each individual, 10,000 demos will
take more than one year. How will you sell maximum number
of phones by giving minimum number of demos?
DataAnalysisCourse
VenkatReddy
3
What is the need of segmentation?
Solution
• Divide the whole population into two groups employed / unemployed
• Further divide the employed population into two groups high/low salary
• Further divide that group into high /low designation
DataAnalysisCourse
VenkatReddy
4
10000
customers
Unemployed
3000
Employed
7000
Low salary
5000
High Salary
2000
Low
Designation
1800
High
Designation
200
Segmentation and Cluster Analysis
• Cluster is a group of similar objects (cases, points, observations,
examples, members, customers, patients, locations, etc)
• Finding the groups of cases/observations/ objects in the
population such that the objects are
• Homogeneous within the group (high intra-class similarity)
• Heterogeneous between the groups(low inter-class similarity )
DataAnalysisCourse
VenkatReddy
5
Inter-cluster
distances are
maximized
Intra-cluster distances are
minimized
DataAnalysisCourse
VenkatReddy
Applications of Cluster Analysis
• Market Segmentation: Grouping people (with the willingness,
purchasing power, and the authority to buy) according to their
similarity in several dimensions related to a product under
consideration.
• Sales Segmentation: Clustering can tell you what types of customers
buy what products
• Credit Risk: Segmentation of customers based on their credit history
• Operations: High performer segmentation & promotions based on
person’s performance
• Insurance: Identifying groups of motor insurance policy holders with
a high average claim cost.
• City-planning: Identifying groups of houses according to their house
type, value, and geographical location
• Geographical: Identification of areas of similar land use in an earth
observation database.
DataAnalysisCourse
VenkatReddy
6
Types of Clusters
DataAnalysisCourse
VenkatReddy
7
• Partitional clustering or non-hierarchical : A division
of objects into non-overlapping subsets (clusters) such
that each object is in exactly one cluster
• The non-hierarchical methods divide a dataset of N
objects into M clusters.
• K-means clustering, a non-hierarchical technique, is
the most commonly used one in business analytics
• Hierarchical clustering: A set of nested clusters
organized as a hierarchical tree
• The hierarchical methods produce a set of nested
clusters in which each pair of objects or clusters is
progressively nested in a larger cluster until only one
cluster remains
• CHAID tree is most widely used in business analytics
Cluster Analysis -Example
DataAnalysisCourse
VenkatReddy
8
Maths Science Gk Apt
Student-1 94 82 87 89
Student-2 46 67 33 72
Student-3 98 97 93 100
Student-4 14 5 7 24
Student-5 86 97 95 95
Student-6 34 32 75 66
Student-7 69 44 59 55
Student-8 85 90 96 89
Student-9 24 26 15 22
Maths Science Gk Apt
Student-1 94 82 87 89
Student-2 46 67 33 72
Student-3 98 97 93 100
Student-4 14 5 7 24
Student-5 86 97 95 95
Student-6 34 32 75 66
Student-7 69 44 59 55
Student-8 85 90 96 89
Student-9 24 26 15 22
Maths Science Gk Apt
Student-4 14 5 7 24
Student-9 24 26 15 22
Student-6 34 32 75 66
Student-2 46 67 33 72
Student-7 69 44 59 55
Student-8 85 90 96 89
Student-5 86 97 95 95
Student-1 94 82 87 89
Student-3 98 97 93 100
4,9,6
2,7
8,5,1,3
Building Clusters
1. Select a distance measure
2. Select a clustering algorithm
3. Define the distance between two clusters
4. Determine the number of clusters
5. Validate the analysis
DataAnalysisCourse
VenkatReddy
9
• The aim is to build clusters i.e divide the whole population into group of similar
objects
• What is similarity/dis-similarity?
• How do you define distance between two clusters
Dissimilarity & Similarity
DataAnalysisCourse
VenkatReddy
10
Weight
Cust1 68
Cust2 72
Cust3 100
Weight Age
Cust1 68 25
Cust2 72 70
Cust3 100 28
Weight Age Income
Cust1 68 25 60,000
Cust2 72 70 9,000
Cust3 100 28 62,000
Which two customers are similar?
Which two customers are similar now?
Which two customers are similar in
this case?
Quantify dissimilarity-Distancemeasures
• To measure similarity between two observations a
distance measure is needed. With a single variable,
similarity is straightforward
• Example: income – two individuals are similar if their income
level is similar and the level of dissimilarity increases as the
income gap increases
• Multiple variables require an aggregate distance
measure
• Many characteristics (e.g. income, age, consumption habits,
family composition, owning a car, education level, job…), it
becomes more difficult to define similarity with a single value
• The most known measure of distance is the Euclidean
distance, which is the concept we use in everyday life for
spatial coordinates.
DataAnalysisCourse
VenkatReddy
11
Examples of distances
DataAnalysisCourse
VenkatReddy
12
 
2
1
n
ij ki kj
k
D x x

 
1
n
ij ki kj
k
D x x

 
Euclidean distance
City-block (Manhattan) distance
A
B
A
B
Dij distance between cases i and j xkj - value of variable xk for case j
Other distance measures: Chebychev, Minkowski, Mahalanobis,
maximum distance, cosine similarity, simple correlation between
observations etc.,


















npx...nfx...n1x
...............
ipx...ifx...i1x
...............
1px...1fx...11x
















0...)2,()1,(
:::
)2,3()
...ndnd
0dd(3,1
0d(2,1)
0
Data matrix Dissimilarity matrix
Calculating the distance
DataAnalysisCourse
VenkatReddy
13
Weight
Cust1 68
Cust2 72
Cust3 100
• Cust1 vs Cust2 :- (68-72)= 4
• Cust2 vs Cust3 :- (72-100) = 28
• Cust3 vs Cust1 :- (100-68) =32
Weight Age
Cust1 68 25
Cust2 72 70
Cust3 100 28
• Cust1 vs Cust2 :- sqrt((68-72)^2 + (25-70)^2)=44.9
• Cust2 vs Cust3 :- 50.54
• Cust3 vs Cust1 :- 32.14
Demo: Calculation of distance
proc distance data=cust_data out=Dist method=Euclid nostd;
var interval(Credit_score Expenses);
run;
proc print data=Dist;
run;
DataAnalysisCourse
VenkatReddy
14
Lab: Distance Calculation
proc distance data=cust_data out=Count_Dist method=Euclid
nostd;
var interval(Area_Sq_Miles_ GDP_MM_ Unemp_rate);
run;
proc print data=Count_Dist;
run;
DataAnalysisCourse
VenkatReddy
15
Clustering algorithms
• k-means clustering algorithm
• Fuzzy c-means clustering algorithm
• Hierarchical clustering algorithm
• Gaussian(EM) clustering algorithm
• Quality Threshold (QT) clustering algorithm
• MST based clustering algorithm
• Density based clustering algorithm
• kernel k-means clustering algorithm
DataAnalysisCourse
VenkatReddy
16
K -Means Clustering – Algorithm
1. The number k of clusters is fixed
2. An initial set of k “seeds” (aggregation centres) is provided
1. First k elements
2. Other seeds (randomly selected or explicitly defined)
3. Given a certain fixed threshold, all units are assigned to the
nearest cluster seed
4. New seeds are computed
5. Go back to step 3 until no reclassification is necessary
Or simply
Initialize k cluster centers
Do
Assignment step: Assign each data point to its closest cluster center
Re-estimation step: Re-compute cluster centers
While (there are still changes in the cluster centers)
DataAnalysisCourse
VenkatReddy
17
K-Means clustering
DataAnalysisCourse
VenkatReddy
18
Overall population
K-Means clustering
DataAnalysisCourse
VenkatReddy
19
Fix the number of clusters
K-Means clustering
DataAnalysisCourse
VenkatReddy
20
Calculate the distance of
each case from all clusters
K-Means clustering
DataAnalysisCourse
VenkatReddy
21
Assign each case to nearest
cluster
K-Means clustering
DataAnalysisCourse
VenkatReddy
22
Re calculate the cluster
centers
K-Means clustering
DataAnalysisCourse
VenkatReddy
23
K-Means clustering
DataAnalysisCourse
VenkatReddy
24
K-Means clustering
DataAnalysisCourse
VenkatReddy
25
K-Means clustering
DataAnalysisCourse
VenkatReddy
26
K-Means clustering
DataAnalysisCourse
VenkatReddy
27
K-Means clustering
DataAnalysisCourse
VenkatReddy
28
K-Means clustering
DataAnalysisCourse
VenkatReddy
29
Reassign after changing the
cluster centers
K-Means clustering
DataAnalysisCourse
VenkatReddy
30
K-Means clustering
DataAnalysisCourse
VenkatReddy
31
Continue till there is no
significant change between
two iterations
K Means clustering in action
DataAnalysisCourse
VenkatReddy
32
• Dividing the data into 10 clusters using K-Means
Distance metric will
decide cluster for
these points
K-Means Clustering SAS Demo
proc fastclus data= sup_market radius=0 replace=full
maxclusters =5 maxiter =20 distance out=clustr_out;
id cust_id;
Var age family_size income spend visit_Other_shops;
run;
DataAnalysisCourse
VenkatReddy
33
• A Supermarket wanted to send some promotional coupons to 100
families
• The idea is to identify 100 customers with medium income and low
recent spends
Lab: K- Means Clustering
• Download contact center agents data
• The performance data contains
• Average handling time
• Average number of calls
• CSAT
• Resolution score
• Identify top 10 agents for promotion based on below criteria
• High C_SAT
• High Resolution
• Low Average handling time
• High number of calls
DataAnalysisCourse
VenkatReddy
34
SAS Code Options
• The RADIUS= option establishes the minimum distance criterion for
selecting new seeds. No observation is considered as a new seed unless its
minimum distance to previous seeds exceeds the value given by the
RADIUS= option. The default value is 0.
• The MAXCLUSTERS= option specifies the maximum number of clusters
allowed. If you omit the MAXCLUSTERS= option, a value of 100 is assumed.
• The REPLACE= option specifies how seed replacement is performed.
• FULL :requests default seed replacement.
• PART :requests seed replacement only when the distance between the
observation and the closest seed is greater than the minimum distance between
seeds.
• NONE : suppresses seed replacement.
• RANDOM :Selects a simple pseudo-random sample of complete observations as
initial cluster seeds.
DataAnalysisCourse
VenkatReddy
35
SAS Code & Options
• The MAXITER= option specifies the maximum number of iterations for re
computing cluster seeds. When the value of the MAXITER= option is greater
than 0, each observation is assigned to the nearest seed, and the seeds are
recomputed as the means of the clusters.
• The LIST option lists all observations, giving the value of the ID variable (if
any), the number of the cluster to which the observation is assigned, and
the distance between the observation and the final cluster seed.
• The DISTANCE option computes distances between the cluster means.
• The ID variable, which can be character or numeric, identifies observations
on the output when you specify the LIST option.
• The VAR statement lists the numeric variables to be used in the cluster
analysis. If you omit the VAR statement, all numeric variables not listed in
other statements are used.
DataAnalysisCourse
VenkatReddy
36
Distance between Clusters
• Single link: smallest distance between an element in one cluster and an
element in the other, i.e., dist(Ki, Kj) = min(tip, tjq)
• Complete link: largest distance between an element in one cluster and an
element in the other, i.e., dist(Ki, Kj) = max(tip, tjq)
• Average: avg distance between an element in one cluster and an element in
the other, i.e., dist(Ki, Kj) = avg(tip, tjq)
• Centroid: distance between the centroids of two clusters, i.e., dist(Ki, Kj) =
dist(Ci, Cj)
• Medoid: distance between the medoids of two clusters, i.e., dist(Ki, Kj) =
dist(Mi, Mj) Medoid: a chosen, centrally located object in the cluster
DataAnalysisCourse
VenkatReddy
37
X X
SAS output interpretation
• RMSSTD - Pooled standard deviation of all the variables forming the
cluster.(Variance within a cluster) Since the objective of cluster analysis is to
form homogeneous groups, the
• RMSSTD of a cluster should be as small as possible
• SPRSQ -Semipartial R-squared is a measure of the homogeneity of merged
clusters, so SPRSQ is the loss of homogeneity due to combining two groups
or clusters to form a new group or cluster. (error incurred by combining two
groups)
• Thus, the SPRSQ value should be small to imply that we are merging two
homogeneous groups
DataAnalysisCourse
VenkatReddy
38
SAS output interpretation
• RSQ (R-squared) measures the extent to which groups or clusters
are different from each other. (Variance between the clusters)
• So, when you have just one cluster RSQ value is, intuitively, zero).
Thus, the RSQ value should be high.
• Centroid Distance is simply the Euclidian distance between the
centroid of the two clusters that are to be joined or merged.
• So, Centroid Distance is a measure of the homogeneity of merged
clusters and the value should be small.
DataAnalysisCourse
VenkatReddy
39
Distance Calculation on
standardized data
DataAnalysisCourse
VenkatReddy
40
Weight Income
Cust1 68 60,000
Cust2 72 9,000
Cust3 100 62,000
Average 80 43667
Stdev 14 24527
Weight Income
Cust1 -0.84 0.67
Cust2 -0.56 -1.41
Cust3 1.40 0.75

Más contenido relacionado

La actualidad más candente

Decision Tree Learning
Decision Tree LearningDecision Tree Learning
Decision Tree LearningMilind Gokhale
 
Machine Learning Clustering
Machine Learning ClusteringMachine Learning Clustering
Machine Learning ClusteringRupak Roy
 
2.2 decision tree
2.2 decision tree2.2 decision tree
2.2 decision treeKrish_ver2
 
Expectation Maximization and Gaussian Mixture Models
Expectation Maximization and Gaussian Mixture ModelsExpectation Maximization and Gaussian Mixture Models
Expectation Maximization and Gaussian Mixture Modelspetitegeek
 
Introduction to Machine Learning Classifiers
Introduction to Machine Learning ClassifiersIntroduction to Machine Learning Classifiers
Introduction to Machine Learning ClassifiersFunctional Imperative
 
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)Parth Khare
 
Data mining: Classification and prediction
Data mining: Classification and predictionData mining: Classification and prediction
Data mining: Classification and predictionDataminingTools Inc
 
Machine Learning with Decision trees
Machine Learning with Decision treesMachine Learning with Decision trees
Machine Learning with Decision treesKnoldus Inc.
 
K mean-clustering algorithm
K mean-clustering algorithmK mean-clustering algorithm
K mean-clustering algorithmparry prabhu
 
Decision Trees
Decision TreesDecision Trees
Decision TreesStudent
 
Principal component analysis and lda
Principal component analysis and ldaPrincipal component analysis and lda
Principal component analysis and ldaSuresh Pokharel
 
Spectral clustering
Spectral clusteringSpectral clustering
Spectral clusteringSOYEON KIM
 
What is the Expectation Maximization (EM) Algorithm?
What is the Expectation Maximization (EM) Algorithm?What is the Expectation Maximization (EM) Algorithm?
What is the Expectation Maximization (EM) Algorithm?Kazuki Yoshida
 
Introduction to unsupervised learning: outlier detection
Introduction to unsupervised learning: outlier detectionIntroduction to unsupervised learning: outlier detection
Introduction to unsupervised learning: outlier detectionJoseph Itopa Abubakar
 
DBSCAN (2014_11_25 06_21_12 UTC)
DBSCAN (2014_11_25 06_21_12 UTC)DBSCAN (2014_11_25 06_21_12 UTC)
DBSCAN (2014_11_25 06_21_12 UTC)Cory Cook
 
Hierarchical clustering.pptx
Hierarchical clustering.pptxHierarchical clustering.pptx
Hierarchical clustering.pptxNTUConcepts1
 

La actualidad más candente (20)

Decision Tree Learning
Decision Tree LearningDecision Tree Learning
Decision Tree Learning
 
Clustering
ClusteringClustering
Clustering
 
Data cleaning-outlier-detection
Data cleaning-outlier-detectionData cleaning-outlier-detection
Data cleaning-outlier-detection
 
Machine Learning Clustering
Machine Learning ClusteringMachine Learning Clustering
Machine Learning Clustering
 
2.2 decision tree
2.2 decision tree2.2 decision tree
2.2 decision tree
 
Expectation Maximization and Gaussian Mixture Models
Expectation Maximization and Gaussian Mixture ModelsExpectation Maximization and Gaussian Mixture Models
Expectation Maximization and Gaussian Mixture Models
 
Introduction to Machine Learning Classifiers
Introduction to Machine Learning ClassifiersIntroduction to Machine Learning Classifiers
Introduction to Machine Learning Classifiers
 
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
 
Data mining: Classification and prediction
Data mining: Classification and predictionData mining: Classification and prediction
Data mining: Classification and prediction
 
Machine Learning with Decision trees
Machine Learning with Decision treesMachine Learning with Decision trees
Machine Learning with Decision trees
 
K mean-clustering algorithm
K mean-clustering algorithmK mean-clustering algorithm
K mean-clustering algorithm
 
Decision Trees
Decision TreesDecision Trees
Decision Trees
 
Principal component analysis and lda
Principal component analysis and ldaPrincipal component analysis and lda
Principal component analysis and lda
 
Clustering
ClusteringClustering
Clustering
 
Spectral clustering
Spectral clusteringSpectral clustering
Spectral clustering
 
What is the Expectation Maximization (EM) Algorithm?
What is the Expectation Maximization (EM) Algorithm?What is the Expectation Maximization (EM) Algorithm?
What is the Expectation Maximization (EM) Algorithm?
 
Resampling methods
Resampling methodsResampling methods
Resampling methods
 
Introduction to unsupervised learning: outlier detection
Introduction to unsupervised learning: outlier detectionIntroduction to unsupervised learning: outlier detection
Introduction to unsupervised learning: outlier detection
 
DBSCAN (2014_11_25 06_21_12 UTC)
DBSCAN (2014_11_25 06_21_12 UTC)DBSCAN (2014_11_25 06_21_12 UTC)
DBSCAN (2014_11_25 06_21_12 UTC)
 
Hierarchical clustering.pptx
Hierarchical clustering.pptxHierarchical clustering.pptx
Hierarchical clustering.pptx
 

Destacado

Model selection and cross validation techniques
Model selection and cross validation techniquesModel selection and cross validation techniques
Model selection and cross validation techniquesVenkata Reddy Konasani
 
Individual movements and geographical data mining. Clustering algorithms for ...
Individual movements and geographical data mining. Clustering algorithms for ...Individual movements and geographical data mining. Clustering algorithms for ...
Individual movements and geographical data mining. Clustering algorithms for ...Beniamino Murgante
 
Homotopic Frechet Distance Between Curves
Homotopic Frechet Distance Between CurvesHomotopic Frechet Distance Between Curves
Homotopic Frechet Distance Between Curvesshripadthite
 
Spatio-Temporal Data Mining and Classification of Ships' Trajectories
Spatio-Temporal Data Mining and Classification of Ships' TrajectoriesSpatio-Temporal Data Mining and Classification of Ships' Trajectories
Spatio-Temporal Data Mining and Classification of Ships' TrajectoriesCentre of Geographic Sciences (COGS)
 
Trajectory clustering - Traclus Algorithm
Trajectory clustering - Traclus AlgorithmTrajectory clustering - Traclus Algorithm
Trajectory clustering - Traclus AlgorithmIván Sanchez Vera
 
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...Salah Amean
 
Learning Tableau - Data, Graphs, Filters, Dashboards and Advanced features
Learning Tableau -  Data, Graphs, Filters, Dashboards and Advanced featuresLearning Tableau -  Data, Graphs, Filters, Dashboards and Advanced features
Learning Tableau - Data, Graphs, Filters, Dashboards and Advanced featuresVenkata Reddy Konasani
 

Destacado (9)

Model selection and cross validation techniques
Model selection and cross validation techniquesModel selection and cross validation techniques
Model selection and cross validation techniques
 
Individual movements and geographical data mining. Clustering algorithms for ...
Individual movements and geographical data mining. Clustering algorithms for ...Individual movements and geographical data mining. Clustering algorithms for ...
Individual movements and geographical data mining. Clustering algorithms for ...
 
Homotopic Frechet Distance Between Curves
Homotopic Frechet Distance Between CurvesHomotopic Frechet Distance Between Curves
Homotopic Frechet Distance Between Curves
 
Spatio-Temporal Data Mining and Classification of Ships' Trajectories
Spatio-Temporal Data Mining and Classification of Ships' TrajectoriesSpatio-Temporal Data Mining and Classification of Ships' Trajectories
Spatio-Temporal Data Mining and Classification of Ships' Trajectories
 
Trajectory clustering - Traclus Algorithm
Trajectory clustering - Traclus AlgorithmTrajectory clustering - Traclus Algorithm
Trajectory clustering - Traclus Algorithm
 
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
 
Learning Tableau - Data, Graphs, Filters, Dashboards and Advanced features
Learning Tableau -  Data, Graphs, Filters, Dashboards and Advanced featuresLearning Tableau -  Data, Graphs, Filters, Dashboards and Advanced features
Learning Tableau - Data, Graphs, Filters, Dashboards and Advanced features
 
GBM theory code and parameters
GBM theory code and parametersGBM theory code and parameters
GBM theory code and parameters
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
 

Similar a Cluster Analysis for Dummies

26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.pptvikassingh569137
 
machine learning - Clustering in R
machine learning - Clustering in Rmachine learning - Clustering in R
machine learning - Clustering in RSudhakar Chavan
 
Clustering (from Google)
Clustering (from Google)Clustering (from Google)
Clustering (from Google)Sri Prasanna
 
CS8091_BDA_Unit_II_Clustering
CS8091_BDA_Unit_II_ClusteringCS8091_BDA_Unit_II_Clustering
CS8091_BDA_Unit_II_ClusteringPalani Kumar
 
CLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptxCLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptxShwetapadmaBabu1
 
CSA 3702 machine learning module 3
CSA 3702 machine learning module 3CSA 3702 machine learning module 3
CSA 3702 machine learning module 3Nandhini S
 
Advanced database and data mining & clustering concepts
Advanced database and data mining & clustering conceptsAdvanced database and data mining & clustering concepts
Advanced database and data mining & clustering conceptsNithyananthSengottai
 
MODULE 4_ CLUSTERING.pptx
MODULE 4_ CLUSTERING.pptxMODULE 4_ CLUSTERING.pptx
MODULE 4_ CLUSTERING.pptxnikshaikh786
 
Optimising Data Using K-Means Clustering Algorithm
Optimising Data Using K-Means Clustering AlgorithmOptimising Data Using K-Means Clustering Algorithm
Optimising Data Using K-Means Clustering AlgorithmIJERA Editor
 
Premeditated Initial Points for K-Means Clustering
Premeditated Initial Points for K-Means ClusteringPremeditated Initial Points for K-Means Clustering
Premeditated Initial Points for K-Means ClusteringIJCSIS Research Publications
 
Clustering techniques
Clustering techniquesClustering techniques
Clustering techniquestalktoharry
 
DS9 - Clustering.pptx
DS9 - Clustering.pptxDS9 - Clustering.pptx
DS9 - Clustering.pptxJK970901
 
Chapter 10. Cluster Analysis Basic Concepts and Methods.ppt
Chapter 10. Cluster Analysis Basic Concepts and Methods.pptChapter 10. Cluster Analysis Basic Concepts and Methods.ppt
Chapter 10. Cluster Analysis Basic Concepts and Methods.pptSubrata Kumer Paul
 

Similar a Cluster Analysis for Dummies (20)

26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt
 
Vi sem
Vi semVi sem
Vi sem
 
machine learning - Clustering in R
machine learning - Clustering in Rmachine learning - Clustering in R
machine learning - Clustering in R
 
Knn 160904075605-converted
Knn 160904075605-convertedKnn 160904075605-converted
Knn 160904075605-converted
 
Clustering.pptx
Clustering.pptxClustering.pptx
Clustering.pptx
 
Clustering (from Google)
Clustering (from Google)Clustering (from Google)
Clustering (from Google)
 
CS8091_BDA_Unit_II_Clustering
CS8091_BDA_Unit_II_ClusteringCS8091_BDA_Unit_II_Clustering
CS8091_BDA_Unit_II_Clustering
 
CLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptxCLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptx
 
CSA 3702 machine learning module 3
CSA 3702 machine learning module 3CSA 3702 machine learning module 3
CSA 3702 machine learning module 3
 
Neural nw k means
Neural nw k meansNeural nw k means
Neural nw k means
 
Introduction to data mining and machine learning
Introduction to data mining and machine learningIntroduction to data mining and machine learning
Introduction to data mining and machine learning
 
Advanced database and data mining & clustering concepts
Advanced database and data mining & clustering conceptsAdvanced database and data mining & clustering concepts
Advanced database and data mining & clustering concepts
 
MODULE 4_ CLUSTERING.pptx
MODULE 4_ CLUSTERING.pptxMODULE 4_ CLUSTERING.pptx
MODULE 4_ CLUSTERING.pptx
 
UNIT_V_Cluster Analysis.pptx
UNIT_V_Cluster Analysis.pptxUNIT_V_Cluster Analysis.pptx
UNIT_V_Cluster Analysis.pptx
 
Optimising Data Using K-Means Clustering Algorithm
Optimising Data Using K-Means Clustering AlgorithmOptimising Data Using K-Means Clustering Algorithm
Optimising Data Using K-Means Clustering Algorithm
 
Premeditated Initial Points for K-Means Clustering
Premeditated Initial Points for K-Means ClusteringPremeditated Initial Points for K-Means Clustering
Premeditated Initial Points for K-Means Clustering
 
DM_clustering.ppt
DM_clustering.pptDM_clustering.ppt
DM_clustering.ppt
 
Clustering techniques
Clustering techniquesClustering techniques
Clustering techniques
 
DS9 - Clustering.pptx
DS9 - Clustering.pptxDS9 - Clustering.pptx
DS9 - Clustering.pptx
 
Chapter 10. Cluster Analysis Basic Concepts and Methods.ppt
Chapter 10. Cluster Analysis Basic Concepts and Methods.pptChapter 10. Cluster Analysis Basic Concepts and Methods.ppt
Chapter 10. Cluster Analysis Basic Concepts and Methods.ppt
 

Más de Venkata Reddy Konasani

Machine Learning Deep Learning AI and Data Science
Machine Learning Deep Learning AI and Data Science Machine Learning Deep Learning AI and Data Science
Machine Learning Deep Learning AI and Data Science Venkata Reddy Konasani
 
Table of Contents - Practical Business Analytics using SAS
Table of Contents - Practical Business Analytics using SAS Table of Contents - Practical Business Analytics using SAS
Table of Contents - Practical Business Analytics using SAS Venkata Reddy Konasani
 
Data exploration validation and sanitization
Data exploration validation and sanitizationData exploration validation and sanitization
Data exploration validation and sanitizationVenkata Reddy Konasani
 
Introduction to predictive modeling v1
Introduction to predictive modeling v1Introduction to predictive modeling v1
Introduction to predictive modeling v1Venkata Reddy Konasani
 
Model building in credit card and loan approval
Model building in credit card and loan approval Model building in credit card and loan approval
Model building in credit card and loan approval Venkata Reddy Konasani
 

Más de Venkata Reddy Konasani (20)

Transformers 101
Transformers 101 Transformers 101
Transformers 101
 
Machine Learning Deep Learning AI and Data Science
Machine Learning Deep Learning AI and Data Science Machine Learning Deep Learning AI and Data Science
Machine Learning Deep Learning AI and Data Science
 
Neural Network Part-2
Neural Network Part-2Neural Network Part-2
Neural Network Part-2
 
Neural Networks made easy
Neural Networks made easyNeural Networks made easy
Neural Networks made easy
 
Step By Step Guide to Learn R
Step By Step Guide to Learn RStep By Step Guide to Learn R
Step By Step Guide to Learn R
 
Credit Risk Model Building Steps
Credit Risk Model Building StepsCredit Risk Model Building Steps
Credit Risk Model Building Steps
 
Table of Contents - Practical Business Analytics using SAS
Table of Contents - Practical Business Analytics using SAS Table of Contents - Practical Business Analytics using SAS
Table of Contents - Practical Business Analytics using SAS
 
SAS basics Step by step learning
SAS basics Step by step learningSAS basics Step by step learning
SAS basics Step by step learning
 
Testing of hypothesis case study
Testing of hypothesis case study Testing of hypothesis case study
Testing of hypothesis case study
 
L101 predictive modeling case_study
L101 predictive modeling case_studyL101 predictive modeling case_study
L101 predictive modeling case_study
 
Machine Learning for Dummies
Machine Learning for DummiesMachine Learning for Dummies
Machine Learning for Dummies
 
Online data sources for analaysis
Online data sources for analaysis Online data sources for analaysis
Online data sources for analaysis
 
A data analyst view of Bigdata
A data analyst view of Bigdata A data analyst view of Bigdata
A data analyst view of Bigdata
 
R- Introduction
R- IntroductionR- Introduction
R- Introduction
 
Data exploration validation and sanitization
Data exploration validation and sanitizationData exploration validation and sanitization
Data exploration validation and sanitization
 
ARIMA
ARIMA ARIMA
ARIMA
 
Introduction to predictive modeling v1
Introduction to predictive modeling v1Introduction to predictive modeling v1
Introduction to predictive modeling v1
 
Big data Introduction by Mohan
Big data Introduction by MohanBig data Introduction by Mohan
Big data Introduction by Mohan
 
Data Analyst - Interview Guide
Data Analyst - Interview GuideData Analyst - Interview Guide
Data Analyst - Interview Guide
 
Model building in credit card and loan approval
Model building in credit card and loan approval Model building in credit card and loan approval
Model building in credit card and loan approval
 

Último

Wellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptxWellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptxJisc
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17Celine George
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.christianmathematics
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfNirmal Dwivedi
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxDenish Jangid
 
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptxExploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptxPooja Bhuva
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfPoh-Sun Goh
 
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfUnit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfDr Vijay Vishwakarma
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxRamakrishna Reddy Bijjam
 
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptxOn_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptxPooja Bhuva
 
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...Amil baba
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxDr. Ravikiran H M Gowda
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.pptRamjanShidvankar
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfAdmir Softic
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
How to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxHow to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxCeline George
 
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Pooja Bhuva
 
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptxCOMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptxannathomasp01
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17Celine George
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptxMaritesTamaniVerdade
 

Último (20)

Wellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptxWellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptx
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptxExploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfUnit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptxOn_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
 
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptx
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
How to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxHow to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptx
 
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
 
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptxCOMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 

Cluster Analysis for Dummies

  • 1. Data Analysis Course Cluster Analysis Venkat Reddy
  • 2. Contents • What is the need of Segmentation • Introduction to Segmentation & Cluster analysis • Applications of Cluster Analysis • Types of Clusters • K-Means clustering DataAnalysisCourse VenkatReddy 2
  • 3. What is the need of segmentation? Problem: • 10,000 Customers - we know their age, city name, income, employment status, designation • You have to sell 100 Blackberry phones(each costs $1000) to the people in this group. You have maximum of 7 days • If you start giving demos to each individual, 10,000 demos will take more than one year. How will you sell maximum number of phones by giving minimum number of demos? DataAnalysisCourse VenkatReddy 3
  • 4. What is the need of segmentation? Solution • Divide the whole population into two groups employed / unemployed • Further divide the employed population into two groups high/low salary • Further divide that group into high /low designation DataAnalysisCourse VenkatReddy 4 10000 customers Unemployed 3000 Employed 7000 Low salary 5000 High Salary 2000 Low Designation 1800 High Designation 200
  • 5. Segmentation and Cluster Analysis • Cluster is a group of similar objects (cases, points, observations, examples, members, customers, patients, locations, etc) • Finding the groups of cases/observations/ objects in the population such that the objects are • Homogeneous within the group (high intra-class similarity) • Heterogeneous between the groups(low inter-class similarity ) DataAnalysisCourse VenkatReddy 5 Inter-cluster distances are maximized Intra-cluster distances are minimized DataAnalysisCourse VenkatReddy
  • 6. Applications of Cluster Analysis • Market Segmentation: Grouping people (with the willingness, purchasing power, and the authority to buy) according to their similarity in several dimensions related to a product under consideration. • Sales Segmentation: Clustering can tell you what types of customers buy what products • Credit Risk: Segmentation of customers based on their credit history • Operations: High performer segmentation & promotions based on person’s performance • Insurance: Identifying groups of motor insurance policy holders with a high average claim cost. • City-planning: Identifying groups of houses according to their house type, value, and geographical location • Geographical: Identification of areas of similar land use in an earth observation database. DataAnalysisCourse VenkatReddy 6
  • 7. Types of Clusters DataAnalysisCourse VenkatReddy 7 • Partitional clustering or non-hierarchical : A division of objects into non-overlapping subsets (clusters) such that each object is in exactly one cluster • The non-hierarchical methods divide a dataset of N objects into M clusters. • K-means clustering, a non-hierarchical technique, is the most commonly used one in business analytics • Hierarchical clustering: A set of nested clusters organized as a hierarchical tree • The hierarchical methods produce a set of nested clusters in which each pair of objects or clusters is progressively nested in a larger cluster until only one cluster remains • CHAID tree is most widely used in business analytics
  • 8. Cluster Analysis -Example DataAnalysisCourse VenkatReddy 8 Maths Science Gk Apt Student-1 94 82 87 89 Student-2 46 67 33 72 Student-3 98 97 93 100 Student-4 14 5 7 24 Student-5 86 97 95 95 Student-6 34 32 75 66 Student-7 69 44 59 55 Student-8 85 90 96 89 Student-9 24 26 15 22 Maths Science Gk Apt Student-1 94 82 87 89 Student-2 46 67 33 72 Student-3 98 97 93 100 Student-4 14 5 7 24 Student-5 86 97 95 95 Student-6 34 32 75 66 Student-7 69 44 59 55 Student-8 85 90 96 89 Student-9 24 26 15 22 Maths Science Gk Apt Student-4 14 5 7 24 Student-9 24 26 15 22 Student-6 34 32 75 66 Student-2 46 67 33 72 Student-7 69 44 59 55 Student-8 85 90 96 89 Student-5 86 97 95 95 Student-1 94 82 87 89 Student-3 98 97 93 100 4,9,6 2,7 8,5,1,3
  • 9. Building Clusters 1. Select a distance measure 2. Select a clustering algorithm 3. Define the distance between two clusters 4. Determine the number of clusters 5. Validate the analysis DataAnalysisCourse VenkatReddy 9 • The aim is to build clusters i.e divide the whole population into group of similar objects • What is similarity/dis-similarity? • How do you define distance between two clusters
  • 10. Dissimilarity & Similarity DataAnalysisCourse VenkatReddy 10 Weight Cust1 68 Cust2 72 Cust3 100 Weight Age Cust1 68 25 Cust2 72 70 Cust3 100 28 Weight Age Income Cust1 68 25 60,000 Cust2 72 70 9,000 Cust3 100 28 62,000 Which two customers are similar? Which two customers are similar now? Which two customers are similar in this case?
  • 11. Quantify dissimilarity-Distancemeasures • To measure similarity between two observations a distance measure is needed. With a single variable, similarity is straightforward • Example: income – two individuals are similar if their income level is similar and the level of dissimilarity increases as the income gap increases • Multiple variables require an aggregate distance measure • Many characteristics (e.g. income, age, consumption habits, family composition, owning a car, education level, job…), it becomes more difficult to define similarity with a single value • The most known measure of distance is the Euclidean distance, which is the concept we use in everyday life for spatial coordinates. DataAnalysisCourse VenkatReddy 11
  • 12. Examples of distances DataAnalysisCourse VenkatReddy 12   2 1 n ij ki kj k D x x    1 n ij ki kj k D x x    Euclidean distance City-block (Manhattan) distance A B A B Dij distance between cases i and j xkj - value of variable xk for case j Other distance measures: Chebychev, Minkowski, Mahalanobis, maximum distance, cosine similarity, simple correlation between observations etc.,                   npx...nfx...n1x ............... ipx...ifx...i1x ............... 1px...1fx...11x                 0...)2,()1,( ::: )2,3() ...ndnd 0dd(3,1 0d(2,1) 0 Data matrix Dissimilarity matrix
  • 13. Calculating the distance DataAnalysisCourse VenkatReddy 13 Weight Cust1 68 Cust2 72 Cust3 100 • Cust1 vs Cust2 :- (68-72)= 4 • Cust2 vs Cust3 :- (72-100) = 28 • Cust3 vs Cust1 :- (100-68) =32 Weight Age Cust1 68 25 Cust2 72 70 Cust3 100 28 • Cust1 vs Cust2 :- sqrt((68-72)^2 + (25-70)^2)=44.9 • Cust2 vs Cust3 :- 50.54 • Cust3 vs Cust1 :- 32.14
  • 14. Demo: Calculation of distance proc distance data=cust_data out=Dist method=Euclid nostd; var interval(Credit_score Expenses); run; proc print data=Dist; run; DataAnalysisCourse VenkatReddy 14
  • 15. Lab: Distance Calculation proc distance data=cust_data out=Count_Dist method=Euclid nostd; var interval(Area_Sq_Miles_ GDP_MM_ Unemp_rate); run; proc print data=Count_Dist; run; DataAnalysisCourse VenkatReddy 15
  • 16. Clustering algorithms • k-means clustering algorithm • Fuzzy c-means clustering algorithm • Hierarchical clustering algorithm • Gaussian(EM) clustering algorithm • Quality Threshold (QT) clustering algorithm • MST based clustering algorithm • Density based clustering algorithm • kernel k-means clustering algorithm DataAnalysisCourse VenkatReddy 16
  • 17. K -Means Clustering – Algorithm 1. The number k of clusters is fixed 2. An initial set of k “seeds” (aggregation centres) is provided 1. First k elements 2. Other seeds (randomly selected or explicitly defined) 3. Given a certain fixed threshold, all units are assigned to the nearest cluster seed 4. New seeds are computed 5. Go back to step 3 until no reclassification is necessary Or simply Initialize k cluster centers Do Assignment step: Assign each data point to its closest cluster center Re-estimation step: Re-compute cluster centers While (there are still changes in the cluster centers) DataAnalysisCourse VenkatReddy 17
  • 31. K-Means clustering DataAnalysisCourse VenkatReddy 31 Continue till there is no significant change between two iterations
  • 32. K Means clustering in action DataAnalysisCourse VenkatReddy 32 • Dividing the data into 10 clusters using K-Means Distance metric will decide cluster for these points
  • 33. K-Means Clustering SAS Demo proc fastclus data= sup_market radius=0 replace=full maxclusters =5 maxiter =20 distance out=clustr_out; id cust_id; Var age family_size income spend visit_Other_shops; run; DataAnalysisCourse VenkatReddy 33 • A Supermarket wanted to send some promotional coupons to 100 families • The idea is to identify 100 customers with medium income and low recent spends
  • 34. Lab: K- Means Clustering • Download contact center agents data • The performance data contains • Average handling time • Average number of calls • CSAT • Resolution score • Identify top 10 agents for promotion based on below criteria • High C_SAT • High Resolution • Low Average handling time • High number of calls DataAnalysisCourse VenkatReddy 34
  • 35. SAS Code Options • The RADIUS= option establishes the minimum distance criterion for selecting new seeds. No observation is considered as a new seed unless its minimum distance to previous seeds exceeds the value given by the RADIUS= option. The default value is 0. • The MAXCLUSTERS= option specifies the maximum number of clusters allowed. If you omit the MAXCLUSTERS= option, a value of 100 is assumed. • The REPLACE= option specifies how seed replacement is performed. • FULL :requests default seed replacement. • PART :requests seed replacement only when the distance between the observation and the closest seed is greater than the minimum distance between seeds. • NONE : suppresses seed replacement. • RANDOM :Selects a simple pseudo-random sample of complete observations as initial cluster seeds. DataAnalysisCourse VenkatReddy 35
  • 36. SAS Code & Options • The MAXITER= option specifies the maximum number of iterations for re computing cluster seeds. When the value of the MAXITER= option is greater than 0, each observation is assigned to the nearest seed, and the seeds are recomputed as the means of the clusters. • The LIST option lists all observations, giving the value of the ID variable (if any), the number of the cluster to which the observation is assigned, and the distance between the observation and the final cluster seed. • The DISTANCE option computes distances between the cluster means. • The ID variable, which can be character or numeric, identifies observations on the output when you specify the LIST option. • The VAR statement lists the numeric variables to be used in the cluster analysis. If you omit the VAR statement, all numeric variables not listed in other statements are used. DataAnalysisCourse VenkatReddy 36
  • 37. Distance between Clusters • Single link: smallest distance between an element in one cluster and an element in the other, i.e., dist(Ki, Kj) = min(tip, tjq) • Complete link: largest distance between an element in one cluster and an element in the other, i.e., dist(Ki, Kj) = max(tip, tjq) • Average: avg distance between an element in one cluster and an element in the other, i.e., dist(Ki, Kj) = avg(tip, tjq) • Centroid: distance between the centroids of two clusters, i.e., dist(Ki, Kj) = dist(Ci, Cj) • Medoid: distance between the medoids of two clusters, i.e., dist(Ki, Kj) = dist(Mi, Mj) Medoid: a chosen, centrally located object in the cluster DataAnalysisCourse VenkatReddy 37 X X
  • 38. SAS output interpretation • RMSSTD - Pooled standard deviation of all the variables forming the cluster.(Variance within a cluster) Since the objective of cluster analysis is to form homogeneous groups, the • RMSSTD of a cluster should be as small as possible • SPRSQ -Semipartial R-squared is a measure of the homogeneity of merged clusters, so SPRSQ is the loss of homogeneity due to combining two groups or clusters to form a new group or cluster. (error incurred by combining two groups) • Thus, the SPRSQ value should be small to imply that we are merging two homogeneous groups DataAnalysisCourse VenkatReddy 38
  • 39. SAS output interpretation • RSQ (R-squared) measures the extent to which groups or clusters are different from each other. (Variance between the clusters) • So, when you have just one cluster RSQ value is, intuitively, zero). Thus, the RSQ value should be high. • Centroid Distance is simply the Euclidian distance between the centroid of the two clusters that are to be joined or merged. • So, Centroid Distance is a measure of the homogeneity of merged clusters and the value should be small. DataAnalysisCourse VenkatReddy 39
  • 40. Distance Calculation on standardized data DataAnalysisCourse VenkatReddy 40 Weight Income Cust1 68 60,000 Cust2 72 9,000 Cust3 100 62,000 Average 80 43667 Stdev 14 24527 Weight Income Cust1 -0.84 0.67 Cust2 -0.56 -1.41 Cust3 1.40 0.75