SlideShare una empresa de Scribd logo
1 de 24
Descargar para leer sin conexión
Prof. Pier Luca Lanzi
Clustering: Introduction
Data Mining andText Mining (UIC 583 @ Politecnico di Milano)
Prof. Pier Luca Lanzi
Readings
•  Mining of Massive Datasets (Chapter 7, Section 3.5)
2
Prof. Pier Luca Lanzi
3
Prof. Pier Luca Lanzi
Prof. Pier Luca Lanzi
Prof. Pier Luca Lanzi
Clustering algorithms group a collection of data points
into “clusters” according to some distance measure
Data points in the same cluster should have
a small distance from one another
Data points in different clusters should be at
a large distance from one another.
Prof. Pier Luca Lanzi
Clustering finds “natural” grouping/structure in un-labeled data
(Unsupervised Learning)
Prof. Pier Luca Lanzi
What is Cluster Analysis?
•  A cluster is a collection of data objects
§ Similar to one another within the same cluster
§ Dissimilar to the objects in other clusters
•  Cluster analysis
§ Given a set data points try to understand their structure
§ Finds similarities between data according to the characteristics
found in the data
§ Groups similar data objects into clusters
§ It is unsupervised learning since there is no predefined classes
•  Typical applications
§ Stand-alone tool to get insight into data
§ Preprocessing step for other algorithms
8
Prof. Pier Luca Lanzi
Clustering Methods
•  Hierarchical vs point assignment
•  Numeric and/or symbolic data
•  Deterministic vs. probabilistic
•  Exclusive vs. overlapping
•  Hierarchical vs. flat
•  Top-down vs. bottom-up
9
Prof. Pier Luca Lanzi
Clustering Applications
•  Marketing
§ Help marketers discover distinct groups in their customer bases,
and then use this knowledge to develop targeted marketing
programs
•  Land use
§ Identification of areas of similar land use in an earth observation
database
•  Insurance
§ Identifying groups of motor insurance policy holders with a high
average claim cost
•  City-planning
§ Identifying groups of houses according to their house type, value,
and geographical location
•  Earth-quake studies
§ Observed earth quake epicenters should be clustered along
continent faults
10
Prof. Pier Luca Lanzi
What Is Good Clustering?
•  A good clustering consists of high quality clusters with
§ High intra-class similarity
§ Low inter-class similarity
•  The quality of a clustering result depends on both the similarity
measure used by the method and its implementation
•  The quality of a clustering method is also measured by its ability
to discover some or all of the hidden patterns
•  Evaluation
§ Various measure of intra/inter cluster similarity
§ Manual inspection
§ Benchmarking on existing labels
11
Prof. Pier Luca Lanzi
Measure the Quality of Clustering
•  Dissimilarity/Similarity metric: Similarity is expressed in terms of a
distance function, typically metric d(i, j)
•  There is a separate “quality” function that measures the “goodness” of
a cluster
•  The definitions of distance functions are usually very different for
interval-scaled, boolean, categorical, ordinal ratio, and vector variables
•  Weights should be associated with different variables based on
applications and data semantics
•  It is hard to define “similar enough” or “good enough” as the answer is
typically highly subjective
12
Prof. Pier Luca Lanzi
Data Structures
0
d(2,1) 0
d(3,1) d(3,2) 0
: : :
d(n,1) d(n,2) ... ... 0
!

#
#
#
#
#
#
$
%






Outlook	
   Temp	
   Humidity	
   Windy	
   Play	
  
Sunny	
   Hot	
   High	
   False	
   No	
  
Sunny	
   Hot	
  	
   High	
  	
   True	
   No	
  
Overcast	
  	
   Hot	
  	
  	
   High	
   False	
   Yes	
  
…	
   …	
   …	
   …	
   …	
  
x
11
... x
1f
... x
1p
... ... ... ... ...
x
i1
... x
if
... x
ip
... ... ... ... ...
x
n1
... x
nf
... x
np
!

#
#
#
#
#
#
#
#
$
%








Data Matrix
13
Dis/Similarity Matrix
Prof. Pier Luca Lanzi
Type of Data in Clustering Analysis
•  Interval-scaled variables
•  Binary variables
•  Nominal, ordinal, and ratio variables
•  Variables of mixed types
14
Prof. Pier Luca Lanzi
Distance Measures
Prof. Pier Luca Lanzi
Distance Measures
•  Given a space and a set of points on this space, a distance
measure d(x,y) maps two points x and y to a real number, 
and satisfies three axioms
•  d(x,y) ≥	
 0
•  d(x,y) = 0 if and only x=y
•  d(x,y) = d(y,x)
•  d(x,y) ≤ d(x,z) + d(z,y)
16
Prof. Pier Luca Lanzi
Euclidean Distances 17
here are other distance measures that have been used for Euclidean
any constant r, we can define the Lr-norm to be the distance me
ed by:
d([x1, x2, . . . , xn], [y1, y2, . . . , yn]) = (
n
i=1
|xi − yi|r
)1/r
case r = 2 is the usual L2-norm just mentioned. Another common d
ure is the L1-norm, or Manhattan distance. There, the distance b
points is the sum of the magnitudes of the differences in each dim
called “Manhattan distance” because it is the distance one would
•  Lr-norm
•  Euclidean distance (r=2)
•  Manhattan distance (r=1)
•  L∞-norm
2 Euclidean Distances
most familiar distance measure is the one we normally think of as “dis-
e.” An n-dimensional Euclidean space is one where points are vectors of n
numbers. The conventional distance measure in this space, which we shall
to as the L2-norm, is defined:
d([x1, x2, . . . , xn], [y1, y2, . . . , yn]) =
n
i=1
(xi − yi)2
is, we square the distance in each dimension, sum the squares, and take
positive square root.
is easy to verify the first three requirements for a distance measure are
fied. The Euclidean distance between two points cannot be negative, be-
e the positive square root is intended. Since all squares of real numbers are
egative, any i such that xi ̸= yi forces the distance to be strictly positive.
he other hand, if xi = yi for all i, then the distance is clearly 0. Symmetry
ws because (xi − yi)2
= (yi − xi)2
. The triangle inequality requires a good
of algebra to verify. However, it is well understood to be a property of
Prof. Pier Luca Lanzi
Jaccard Distance
•  Jaccard distance is defined as d(x,y) = 1 – SIM(x,y) where SIM is
the Jaccard similarity,
•  Which can also be interpreted as the percentage of identical
attributes
18
Prof. Pier Luca Lanzi
Cosine Distance
•  The cosine distance between x, y is the angle that the vectors to
those points make
•  This angle will be in the range 0 to 180 degrees, regardless of
how many dimensions the space has.
•  Example: given x = (1,2,-1) and y = (2,1,1) the angle between the
two vectors is 60
19
Prof. Pier Luca Lanzi
Edit Distance
•  Used when the data points are strings
•  The distance between a string x=x1x2…xn and y=y1y2…ym is the smallest
number of insertions and deletions of single characters that will transform x
into y
•  Alternatively, the edit distance d(x, y) can be compute as the longest common
subsequence (LCS) of x and y and then,

d(x,y) = |x| + |y| - 2|LCS|
•  Example: the edit distance between x=abcde and y=acfdeg is 3 (delete b,
insert f, insert g), the LCS is acde which is coherent with the previous result
20
Prof. Pier Luca Lanzi
Hamming Distance
•  Hamming distance between two vectors is the number of
components in which they differ
•  Or equivalently, given the number of variables n, and the number
m of matching components, we define
•  Example: the Hamming distance between the vectors 10101 and
11110 is 3.
21
Prof. Pier Luca Lanzi
Ordinal Variables
•  An ordinal variable can be discrete or continuous
•  Order is important, e.g., rank
•  It can be treated as an interval-scaled
§ replace xif with their rank
§ map the range of each variable onto [0, 1] by replacing 
i-th object in the f-th variable by
§ compute the dissimilarity using methods for interval-scaled variables
22
Prof. Pier Luca Lanzi
Requirements of Clustering in Data Mining
•  Scalability
•  Ability to deal with different types of attributes
•  Ability to handle dynamic data
•  Discovery of clusters with arbitrary shape
•  Minimal requirements for domain knowledge to determine input
parameters
•  Able to deal with noise and outliers
•  Insensitive to order of input records
•  High dimensionality
•  Incorporation of user-specified constraints
•  Interpretability and usability
23
Prof. Pier Luca Lanzi
Curse of Dimensionality
in high dimensions, almost all pairs of points
are equally far away from one another
almost any two vectors are almost orthogonal

Más contenido relacionado

La actualidad más candente

DMTM Lecture 09 Other classificationmethods
DMTM Lecture 09 Other classificationmethodsDMTM Lecture 09 Other classificationmethods
DMTM Lecture 09 Other classificationmethodsPier Luca Lanzi
 
DMTM 2015 - 09 Density Based Clustering
DMTM 2015 - 09 Density Based ClusteringDMTM 2015 - 09 Density Based Clustering
DMTM 2015 - 09 Density Based ClusteringPier Luca Lanzi
 
DMTM Lecture 15 Clustering evaluation
DMTM Lecture 15 Clustering evaluationDMTM Lecture 15 Clustering evaluation
DMTM Lecture 15 Clustering evaluationPier Luca Lanzi
 
DMTM 2015 - 10 Introduction to Classification
DMTM 2015 - 10 Introduction to ClassificationDMTM 2015 - 10 Introduction to Classification
DMTM 2015 - 10 Introduction to ClassificationPier Luca Lanzi
 
DMTM Lecture 03 Regression
DMTM Lecture 03 RegressionDMTM Lecture 03 Regression
DMTM Lecture 03 RegressionPier Luca Lanzi
 
DMTM 2015 - 08 Representative-Based Clustering
DMTM 2015 - 08 Representative-Based ClusteringDMTM 2015 - 08 Representative-Based Clustering
DMTM 2015 - 08 Representative-Based ClusteringPier Luca Lanzi
 
DMTM 2015 - 14 Evaluation of Classification Models
DMTM 2015 - 14 Evaluation of Classification ModelsDMTM 2015 - 14 Evaluation of Classification Models
DMTM 2015 - 14 Evaluation of Classification ModelsPier Luca Lanzi
 
DMTM 2015 - 03 Data Representation
DMTM 2015 - 03 Data RepresentationDMTM 2015 - 03 Data Representation
DMTM 2015 - 03 Data RepresentationPier Luca Lanzi
 
DMTM Lecture 06 Classification evaluation
DMTM Lecture 06 Classification evaluationDMTM Lecture 06 Classification evaluation
DMTM Lecture 06 Classification evaluationPier Luca Lanzi
 
DMTM Lecture 08 Classification rules
DMTM Lecture 08 Classification rulesDMTM Lecture 08 Classification rules
DMTM Lecture 08 Classification rulesPier Luca Lanzi
 
DMTM 2015 - 16 Data Preparation
DMTM 2015 - 16 Data PreparationDMTM 2015 - 16 Data Preparation
DMTM 2015 - 16 Data PreparationPier Luca Lanzi
 
DMTM Lecture 07 Decision trees
DMTM Lecture 07 Decision treesDMTM Lecture 07 Decision trees
DMTM Lecture 07 Decision treesPier Luca Lanzi
 
DMTM Lecture 12 Hierarchical clustering
DMTM Lecture 12 Hierarchical clusteringDMTM Lecture 12 Hierarchical clustering
DMTM Lecture 12 Hierarchical clusteringPier Luca Lanzi
 
DMTM Lecture 05 Data representation
DMTM Lecture 05 Data representationDMTM Lecture 05 Data representation
DMTM Lecture 05 Data representationPier Luca Lanzi
 
DMTM 2015 - 12 Classification Rules
DMTM 2015 - 12 Classification RulesDMTM 2015 - 12 Classification Rules
DMTM 2015 - 12 Classification RulesPier Luca Lanzi
 
DMTM 2015 - 17 Text Mining Part 1
DMTM 2015 - 17 Text Mining Part 1DMTM 2015 - 17 Text Mining Part 1
DMTM 2015 - 17 Text Mining Part 1Pier Luca Lanzi
 
DMTM Lecture 18 Graph mining
DMTM Lecture 18 Graph miningDMTM Lecture 18 Graph mining
DMTM Lecture 18 Graph miningPier Luca Lanzi
 
DMTM Lecture 19 Data exploration
DMTM Lecture 19 Data explorationDMTM Lecture 19 Data exploration
DMTM Lecture 19 Data explorationPier Luca Lanzi
 
DMTM Lecture 16 Association rules
DMTM Lecture 16 Association rulesDMTM Lecture 16 Association rules
DMTM Lecture 16 Association rulesPier Luca Lanzi
 
DMTM 2015 - 11 Decision Trees
DMTM 2015 - 11 Decision TreesDMTM 2015 - 11 Decision Trees
DMTM 2015 - 11 Decision TreesPier Luca Lanzi
 

La actualidad más candente (20)

DMTM Lecture 09 Other classificationmethods
DMTM Lecture 09 Other classificationmethodsDMTM Lecture 09 Other classificationmethods
DMTM Lecture 09 Other classificationmethods
 
DMTM 2015 - 09 Density Based Clustering
DMTM 2015 - 09 Density Based ClusteringDMTM 2015 - 09 Density Based Clustering
DMTM 2015 - 09 Density Based Clustering
 
DMTM Lecture 15 Clustering evaluation
DMTM Lecture 15 Clustering evaluationDMTM Lecture 15 Clustering evaluation
DMTM Lecture 15 Clustering evaluation
 
DMTM 2015 - 10 Introduction to Classification
DMTM 2015 - 10 Introduction to ClassificationDMTM 2015 - 10 Introduction to Classification
DMTM 2015 - 10 Introduction to Classification
 
DMTM Lecture 03 Regression
DMTM Lecture 03 RegressionDMTM Lecture 03 Regression
DMTM Lecture 03 Regression
 
DMTM 2015 - 08 Representative-Based Clustering
DMTM 2015 - 08 Representative-Based ClusteringDMTM 2015 - 08 Representative-Based Clustering
DMTM 2015 - 08 Representative-Based Clustering
 
DMTM 2015 - 14 Evaluation of Classification Models
DMTM 2015 - 14 Evaluation of Classification ModelsDMTM 2015 - 14 Evaluation of Classification Models
DMTM 2015 - 14 Evaluation of Classification Models
 
DMTM 2015 - 03 Data Representation
DMTM 2015 - 03 Data RepresentationDMTM 2015 - 03 Data Representation
DMTM 2015 - 03 Data Representation
 
DMTM Lecture 06 Classification evaluation
DMTM Lecture 06 Classification evaluationDMTM Lecture 06 Classification evaluation
DMTM Lecture 06 Classification evaluation
 
DMTM Lecture 08 Classification rules
DMTM Lecture 08 Classification rulesDMTM Lecture 08 Classification rules
DMTM Lecture 08 Classification rules
 
DMTM 2015 - 16 Data Preparation
DMTM 2015 - 16 Data PreparationDMTM 2015 - 16 Data Preparation
DMTM 2015 - 16 Data Preparation
 
DMTM Lecture 07 Decision trees
DMTM Lecture 07 Decision treesDMTM Lecture 07 Decision trees
DMTM Lecture 07 Decision trees
 
DMTM Lecture 12 Hierarchical clustering
DMTM Lecture 12 Hierarchical clusteringDMTM Lecture 12 Hierarchical clustering
DMTM Lecture 12 Hierarchical clustering
 
DMTM Lecture 05 Data representation
DMTM Lecture 05 Data representationDMTM Lecture 05 Data representation
DMTM Lecture 05 Data representation
 
DMTM 2015 - 12 Classification Rules
DMTM 2015 - 12 Classification RulesDMTM 2015 - 12 Classification Rules
DMTM 2015 - 12 Classification Rules
 
DMTM 2015 - 17 Text Mining Part 1
DMTM 2015 - 17 Text Mining Part 1DMTM 2015 - 17 Text Mining Part 1
DMTM 2015 - 17 Text Mining Part 1
 
DMTM Lecture 18 Graph mining
DMTM Lecture 18 Graph miningDMTM Lecture 18 Graph mining
DMTM Lecture 18 Graph mining
 
DMTM Lecture 19 Data exploration
DMTM Lecture 19 Data explorationDMTM Lecture 19 Data exploration
DMTM Lecture 19 Data exploration
 
DMTM Lecture 16 Association rules
DMTM Lecture 16 Association rulesDMTM Lecture 16 Association rules
DMTM Lecture 16 Association rules
 
DMTM 2015 - 11 Decision Trees
DMTM 2015 - 11 Decision TreesDMTM 2015 - 11 Decision Trees
DMTM 2015 - 11 Decision Trees
 

Destacado

DMTM 2015 - 02 Data Mining
DMTM 2015 - 02 Data MiningDMTM 2015 - 02 Data Mining
DMTM 2015 - 02 Data MiningPier Luca Lanzi
 
DMTM 2015 - 01 Course Introduction
DMTM 2015 - 01 Course IntroductionDMTM 2015 - 01 Course Introduction
DMTM 2015 - 01 Course IntroductionPier Luca Lanzi
 
DMTM 2015 - 05 Association Rules
DMTM 2015 - 05 Association RulesDMTM 2015 - 05 Association Rules
DMTM 2015 - 05 Association RulesPier Luca Lanzi
 
DMTM 2015 - 07 Hierarchical Clustering
DMTM 2015 - 07 Hierarchical ClusteringDMTM 2015 - 07 Hierarchical Clustering
DMTM 2015 - 07 Hierarchical ClusteringPier Luca Lanzi
 
DMTM 2015 - 19 Graph Mining
DMTM 2015 - 19 Graph MiningDMTM 2015 - 19 Graph Mining
DMTM 2015 - 19 Graph MiningPier Luca Lanzi
 
Focus Junior - 14 Maggio 2016
Focus Junior - 14 Maggio 2016Focus Junior - 14 Maggio 2016
Focus Junior - 14 Maggio 2016Pier Luca Lanzi
 
Machine Learning and Data Mining: 12 Classification Rules
Machine Learning and Data Mining: 12 Classification RulesMachine Learning and Data Mining: 12 Classification Rules
Machine Learning and Data Mining: 12 Classification RulesPier Luca Lanzi
 
Lecture 9: Machine Learning in Practice (2)
Lecture 9: Machine Learning in Practice (2)Lecture 9: Machine Learning in Practice (2)
Lecture 9: Machine Learning in Practice (2)Marina Santini
 
Videogame Design and Programming: Conferenza d'Ateneo 18 Maggio 2011
Videogame Design and Programming: Conferenza d'Ateneo 18 Maggio 2011Videogame Design and Programming: Conferenza d'Ateneo 18 Maggio 2011
Videogame Design and Programming: Conferenza d'Ateneo 18 Maggio 2011Pier Luca Lanzi
 
Fitness Inheritance in Evolutionary and
Fitness Inheritance in Evolutionary andFitness Inheritance in Evolutionary and
Fitness Inheritance in Evolutionary andPier Luca Lanzi
 
GECCO-2014 Learning Classifier Systems: A Gentle Introduction
GECCO-2014 Learning Classifier Systems: A Gentle IntroductionGECCO-2014 Learning Classifier Systems: A Gentle Introduction
GECCO-2014 Learning Classifier Systems: A Gentle IntroductionPier Luca Lanzi
 
Lecture 02 Machine Learning For Data Mining
Lecture 02 Machine Learning For Data MiningLecture 02 Machine Learning For Data Mining
Lecture 02 Machine Learning For Data MiningPier Luca Lanzi
 
Evolving Rules to Solve Problems: The Learning Classifier Systems Way
Evolving Rules to Solve Problems: The Learning Classifier Systems WayEvolving Rules to Solve Problems: The Learning Classifier Systems Way
Evolving Rules to Solve Problems: The Learning Classifier Systems WayPier Luca Lanzi
 
Lecture 04 Association Rules Basics
Lecture 04 Association Rules BasicsLecture 04 Association Rules Basics
Lecture 04 Association Rules BasicsPier Luca Lanzi
 
Machine Learning and Data Mining: 02 Machine Learning
Machine Learning and Data Mining: 02 Machine LearningMachine Learning and Data Mining: 02 Machine Learning
Machine Learning and Data Mining: 02 Machine LearningPier Luca Lanzi
 

Destacado (16)

DMTM 2015 - 02 Data Mining
DMTM 2015 - 02 Data MiningDMTM 2015 - 02 Data Mining
DMTM 2015 - 02 Data Mining
 
DMTM 2015 - 01 Course Introduction
DMTM 2015 - 01 Course IntroductionDMTM 2015 - 01 Course Introduction
DMTM 2015 - 01 Course Introduction
 
DMTM 2015 - 05 Association Rules
DMTM 2015 - 05 Association RulesDMTM 2015 - 05 Association Rules
DMTM 2015 - 05 Association Rules
 
DMTM 2015 - 07 Hierarchical Clustering
DMTM 2015 - 07 Hierarchical ClusteringDMTM 2015 - 07 Hierarchical Clustering
DMTM 2015 - 07 Hierarchical Clustering
 
Course Introduction
Course IntroductionCourse Introduction
Course Introduction
 
DMTM 2015 - 19 Graph Mining
DMTM 2015 - 19 Graph MiningDMTM 2015 - 19 Graph Mining
DMTM 2015 - 19 Graph Mining
 
Focus Junior - 14 Maggio 2016
Focus Junior - 14 Maggio 2016Focus Junior - 14 Maggio 2016
Focus Junior - 14 Maggio 2016
 
Machine Learning and Data Mining: 12 Classification Rules
Machine Learning and Data Mining: 12 Classification RulesMachine Learning and Data Mining: 12 Classification Rules
Machine Learning and Data Mining: 12 Classification Rules
 
Lecture 9: Machine Learning in Practice (2)
Lecture 9: Machine Learning in Practice (2)Lecture 9: Machine Learning in Practice (2)
Lecture 9: Machine Learning in Practice (2)
 
Videogame Design and Programming: Conferenza d'Ateneo 18 Maggio 2011
Videogame Design and Programming: Conferenza d'Ateneo 18 Maggio 2011Videogame Design and Programming: Conferenza d'Ateneo 18 Maggio 2011
Videogame Design and Programming: Conferenza d'Ateneo 18 Maggio 2011
 
Fitness Inheritance in Evolutionary and
Fitness Inheritance in Evolutionary andFitness Inheritance in Evolutionary and
Fitness Inheritance in Evolutionary and
 
GECCO-2014 Learning Classifier Systems: A Gentle Introduction
GECCO-2014 Learning Classifier Systems: A Gentle IntroductionGECCO-2014 Learning Classifier Systems: A Gentle Introduction
GECCO-2014 Learning Classifier Systems: A Gentle Introduction
 
Lecture 02 Machine Learning For Data Mining
Lecture 02 Machine Learning For Data MiningLecture 02 Machine Learning For Data Mining
Lecture 02 Machine Learning For Data Mining
 
Evolving Rules to Solve Problems: The Learning Classifier Systems Way
Evolving Rules to Solve Problems: The Learning Classifier Systems WayEvolving Rules to Solve Problems: The Learning Classifier Systems Way
Evolving Rules to Solve Problems: The Learning Classifier Systems Way
 
Lecture 04 Association Rules Basics
Lecture 04 Association Rules BasicsLecture 04 Association Rules Basics
Lecture 04 Association Rules Basics
 
Machine Learning and Data Mining: 02 Machine Learning
Machine Learning and Data Mining: 02 Machine LearningMachine Learning and Data Mining: 02 Machine Learning
Machine Learning and Data Mining: 02 Machine Learning
 

Similar a DMTM 2015 - 06 Introduction to Clustering

Cluster Analysis
Cluster Analysis Cluster Analysis
Cluster Analysis Baivab Nag
 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learningKnoldus Inc.
 
UnSupervised Machincs4811-ch23a-clustering.ppt
UnSupervised Machincs4811-ch23a-clustering.pptUnSupervised Machincs4811-ch23a-clustering.ppt
UnSupervised Machincs4811-ch23a-clustering.pptRamanamurthy Banda
 
GIS in Public Health Research: Understanding Spatial Analysis and Interpretin...
GIS in Public Health Research: Understanding Spatial Analysis and Interpretin...GIS in Public Health Research: Understanding Spatial Analysis and Interpretin...
GIS in Public Health Research: Understanding Spatial Analysis and Interpretin...hpaocec
 
ODSC India 2018: Topological space creation & Clustering at BigData scale
ODSC India 2018: Topological space creation & Clustering at BigData scaleODSC India 2018: Topological space creation & Clustering at BigData scale
ODSC India 2018: Topological space creation & Clustering at BigData scaleKuldeep Jiwani
 
09_dm1_knn_2022_23.pdf
09_dm1_knn_2022_23.pdf09_dm1_knn_2022_23.pdf
09_dm1_knn_2022_23.pdfArafathJazeeb1
 
CSA 3702 machine learning module 3
CSA 3702 machine learning module 3CSA 3702 machine learning module 3
CSA 3702 machine learning module 3Nandhini S
 
Spatial data mining
Spatial data miningSpatial data mining
Spatial data miningMITS Gwalior
 
SPATIAL POINT PATTERNS
SPATIAL POINT PATTERNSSPATIAL POINT PATTERNS
SPATIAL POINT PATTERNSLiemNguyenDuy
 
Slide-TIF311-DM-10-11.ppt
Slide-TIF311-DM-10-11.pptSlide-TIF311-DM-10-11.ppt
Slide-TIF311-DM-10-11.pptSandinoBerutu1
 
Slide-TIF311-DM-10-11.ppt
Slide-TIF311-DM-10-11.pptSlide-TIF311-DM-10-11.ppt
Slide-TIF311-DM-10-11.pptImXaib
 
Srilakshmi alla blindsourceseperation
Srilakshmi alla blindsourceseperationSrilakshmi alla blindsourceseperation
Srilakshmi alla blindsourceseperationSrilakshmi Alla
 
Machine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional ManagersMachine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional ManagersAlbert Y. C. Chen
 
K means clustering
K means clusteringK means clustering
K means clusteringkeshav goyal
 
clustering tendency
clustering tendencyclustering tendency
clustering tendencyAmir Shokri
 

Similar a DMTM 2015 - 06 Introduction to Clustering (20)

Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
 
Cluster Analysis
Cluster Analysis Cluster Analysis
Cluster Analysis
 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learning
 
UnSupervised Machincs4811-ch23a-clustering.ppt
UnSupervised Machincs4811-ch23a-clustering.pptUnSupervised Machincs4811-ch23a-clustering.ppt
UnSupervised Machincs4811-ch23a-clustering.ppt
 
[PPT]
[PPT][PPT]
[PPT]
 
GIS in Public Health Research: Understanding Spatial Analysis and Interpretin...
GIS in Public Health Research: Understanding Spatial Analysis and Interpretin...GIS in Public Health Research: Understanding Spatial Analysis and Interpretin...
GIS in Public Health Research: Understanding Spatial Analysis and Interpretin...
 
ODSC India 2018: Topological space creation & Clustering at BigData scale
ODSC India 2018: Topological space creation & Clustering at BigData scaleODSC India 2018: Topological space creation & Clustering at BigData scale
ODSC India 2018: Topological space creation & Clustering at BigData scale
 
09_dm1_knn_2022_23.pdf
09_dm1_knn_2022_23.pdf09_dm1_knn_2022_23.pdf
09_dm1_knn_2022_23.pdf
 
Cs345 cl
Cs345 clCs345 cl
Cs345 cl
 
PR07.pdf
PR07.pdfPR07.pdf
PR07.pdf
 
CSA 3702 machine learning module 3
CSA 3702 machine learning module 3CSA 3702 machine learning module 3
CSA 3702 machine learning module 3
 
Spatial data mining
Spatial data miningSpatial data mining
Spatial data mining
 
SPATIAL POINT PATTERNS
SPATIAL POINT PATTERNSSPATIAL POINT PATTERNS
SPATIAL POINT PATTERNS
 
Slide-TIF311-DM-10-11.ppt
Slide-TIF311-DM-10-11.pptSlide-TIF311-DM-10-11.ppt
Slide-TIF311-DM-10-11.ppt
 
Slide-TIF311-DM-10-11.ppt
Slide-TIF311-DM-10-11.pptSlide-TIF311-DM-10-11.ppt
Slide-TIF311-DM-10-11.ppt
 
Srilakshmi alla blindsourceseperation
Srilakshmi alla blindsourceseperationSrilakshmi alla blindsourceseperation
Srilakshmi alla blindsourceseperation
 
Machine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional ManagersMachine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional Managers
 
K means clustering
K means clusteringK means clustering
K means clustering
 
clustering tendency
clustering tendencyclustering tendency
clustering tendency
 
Clustering.pdf
Clustering.pdfClustering.pdf
Clustering.pdf
 

Más de Pier Luca Lanzi

11 Settembre 2021 - Giocare con i Videogiochi
11 Settembre 2021 - Giocare con i Videogiochi11 Settembre 2021 - Giocare con i Videogiochi
11 Settembre 2021 - Giocare con i VideogiochiPier Luca Lanzi
 
Breve Viaggio al Centro dei Videogiochi
Breve Viaggio al Centro dei VideogiochiBreve Viaggio al Centro dei Videogiochi
Breve Viaggio al Centro dei VideogiochiPier Luca Lanzi
 
Global Game Jam 19 @ POLIMI - Morning Welcome
Global Game Jam 19 @ POLIMI - Morning WelcomeGlobal Game Jam 19 @ POLIMI - Morning Welcome
Global Game Jam 19 @ POLIMI - Morning WelcomePier Luca Lanzi
 
Data Driven Game Design @ Campus Party 2018
Data Driven Game Design @ Campus Party 2018Data Driven Game Design @ Campus Party 2018
Data Driven Game Design @ Campus Party 2018Pier Luca Lanzi
 
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...Pier Luca Lanzi
 
GGJ18 al Politecnico di Milano - Presentazione di apertura
GGJ18 al Politecnico di Milano - Presentazione di aperturaGGJ18 al Politecnico di Milano - Presentazione di apertura
GGJ18 al Politecnico di Milano - Presentazione di aperturaPier Luca Lanzi
 
Presentation for UNITECH event - January 8, 2018
Presentation for UNITECH event - January 8, 2018Presentation for UNITECH event - January 8, 2018
Presentation for UNITECH event - January 8, 2018Pier Luca Lanzi
 
DMTM Lecture 17 Text mining
DMTM Lecture 17 Text miningDMTM Lecture 17 Text mining
DMTM Lecture 17 Text miningPier Luca Lanzi
 
DMTM Lecture 14 Density based clustering
DMTM Lecture 14 Density based clusteringDMTM Lecture 14 Density based clustering
DMTM Lecture 14 Density based clusteringPier Luca Lanzi
 
DMTM Lecture 01 Introduction
DMTM Lecture 01 IntroductionDMTM Lecture 01 Introduction
DMTM Lecture 01 IntroductionPier Luca Lanzi
 
DMTM Lecture 02 Data mining
DMTM Lecture 02 Data miningDMTM Lecture 02 Data mining
DMTM Lecture 02 Data miningPier Luca Lanzi
 
VDP2016 - Lecture 16 Rendering pipeline
VDP2016 - Lecture 16 Rendering pipelineVDP2016 - Lecture 16 Rendering pipeline
VDP2016 - Lecture 16 Rendering pipelinePier Luca Lanzi
 
VDP2016 - Lecture 15 PCG with Unity
VDP2016 - Lecture 15 PCG with UnityVDP2016 - Lecture 15 PCG with Unity
VDP2016 - Lecture 15 PCG with UnityPier Luca Lanzi
 
VDP2016 - Lecture 14 Procedural content generation
VDP2016 - Lecture 14 Procedural content generationVDP2016 - Lecture 14 Procedural content generation
VDP2016 - Lecture 14 Procedural content generationPier Luca Lanzi
 

Más de Pier Luca Lanzi (14)

11 Settembre 2021 - Giocare con i Videogiochi
11 Settembre 2021 - Giocare con i Videogiochi11 Settembre 2021 - Giocare con i Videogiochi
11 Settembre 2021 - Giocare con i Videogiochi
 
Breve Viaggio al Centro dei Videogiochi
Breve Viaggio al Centro dei VideogiochiBreve Viaggio al Centro dei Videogiochi
Breve Viaggio al Centro dei Videogiochi
 
Global Game Jam 19 @ POLIMI - Morning Welcome
Global Game Jam 19 @ POLIMI - Morning WelcomeGlobal Game Jam 19 @ POLIMI - Morning Welcome
Global Game Jam 19 @ POLIMI - Morning Welcome
 
Data Driven Game Design @ Campus Party 2018
Data Driven Game Design @ Campus Party 2018Data Driven Game Design @ Campus Party 2018
Data Driven Game Design @ Campus Party 2018
 
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...
 
GGJ18 al Politecnico di Milano - Presentazione di apertura
GGJ18 al Politecnico di Milano - Presentazione di aperturaGGJ18 al Politecnico di Milano - Presentazione di apertura
GGJ18 al Politecnico di Milano - Presentazione di apertura
 
Presentation for UNITECH event - January 8, 2018
Presentation for UNITECH event - January 8, 2018Presentation for UNITECH event - January 8, 2018
Presentation for UNITECH event - January 8, 2018
 
DMTM Lecture 17 Text mining
DMTM Lecture 17 Text miningDMTM Lecture 17 Text mining
DMTM Lecture 17 Text mining
 
DMTM Lecture 14 Density based clustering
DMTM Lecture 14 Density based clusteringDMTM Lecture 14 Density based clustering
DMTM Lecture 14 Density based clustering
 
DMTM Lecture 01 Introduction
DMTM Lecture 01 IntroductionDMTM Lecture 01 Introduction
DMTM Lecture 01 Introduction
 
DMTM Lecture 02 Data mining
DMTM Lecture 02 Data miningDMTM Lecture 02 Data mining
DMTM Lecture 02 Data mining
 
VDP2016 - Lecture 16 Rendering pipeline
VDP2016 - Lecture 16 Rendering pipelineVDP2016 - Lecture 16 Rendering pipeline
VDP2016 - Lecture 16 Rendering pipeline
 
VDP2016 - Lecture 15 PCG with Unity
VDP2016 - Lecture 15 PCG with UnityVDP2016 - Lecture 15 PCG with Unity
VDP2016 - Lecture 15 PCG with Unity
 
VDP2016 - Lecture 14 Procedural content generation
VDP2016 - Lecture 14 Procedural content generationVDP2016 - Lecture 14 Procedural content generation
VDP2016 - Lecture 14 Procedural content generation
 

Último

Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfChris Hunter
 
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural ResourcesEnergy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural ResourcesShubhangi Sonawane
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhikauryashika82
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701bronxfugly43
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDThiyagu K
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin ClassesCeline George
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxheathfieldcps1
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.christianmathematics
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...Poonam Aher Patil
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-IIFood Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-IIShubhangi Sonawane
 

Último (20)

Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdf
 
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural ResourcesEnergy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
Asian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptxAsian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptx
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-IIFood Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
 

DMTM 2015 - 06 Introduction to Clustering

  • 1. Prof. Pier Luca Lanzi Clustering: Introduction Data Mining andText Mining (UIC 583 @ Politecnico di Milano)
  • 2. Prof. Pier Luca Lanzi Readings •  Mining of Massive Datasets (Chapter 7, Section 3.5) 2
  • 3. Prof. Pier Luca Lanzi 3
  • 6. Prof. Pier Luca Lanzi Clustering algorithms group a collection of data points into “clusters” according to some distance measure Data points in the same cluster should have a small distance from one another Data points in different clusters should be at a large distance from one another.
  • 7. Prof. Pier Luca Lanzi Clustering finds “natural” grouping/structure in un-labeled data (Unsupervised Learning)
  • 8. Prof. Pier Luca Lanzi What is Cluster Analysis? •  A cluster is a collection of data objects § Similar to one another within the same cluster § Dissimilar to the objects in other clusters •  Cluster analysis § Given a set data points try to understand their structure § Finds similarities between data according to the characteristics found in the data § Groups similar data objects into clusters § It is unsupervised learning since there is no predefined classes •  Typical applications § Stand-alone tool to get insight into data § Preprocessing step for other algorithms 8
  • 9. Prof. Pier Luca Lanzi Clustering Methods •  Hierarchical vs point assignment •  Numeric and/or symbolic data •  Deterministic vs. probabilistic •  Exclusive vs. overlapping •  Hierarchical vs. flat •  Top-down vs. bottom-up 9
  • 10. Prof. Pier Luca Lanzi Clustering Applications •  Marketing § Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs •  Land use § Identification of areas of similar land use in an earth observation database •  Insurance § Identifying groups of motor insurance policy holders with a high average claim cost •  City-planning § Identifying groups of houses according to their house type, value, and geographical location •  Earth-quake studies § Observed earth quake epicenters should be clustered along continent faults 10
  • 11. Prof. Pier Luca Lanzi What Is Good Clustering? •  A good clustering consists of high quality clusters with § High intra-class similarity § Low inter-class similarity •  The quality of a clustering result depends on both the similarity measure used by the method and its implementation •  The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns •  Evaluation § Various measure of intra/inter cluster similarity § Manual inspection § Benchmarking on existing labels 11
  • 12. Prof. Pier Luca Lanzi Measure the Quality of Clustering •  Dissimilarity/Similarity metric: Similarity is expressed in terms of a distance function, typically metric d(i, j) •  There is a separate “quality” function that measures the “goodness” of a cluster •  The definitions of distance functions are usually very different for interval-scaled, boolean, categorical, ordinal ratio, and vector variables •  Weights should be associated with different variables based on applications and data semantics •  It is hard to define “similar enough” or “good enough” as the answer is typically highly subjective 12
  • 13. Prof. Pier Luca Lanzi Data Structures 0 d(2,1) 0 d(3,1) d(3,2) 0 : : : d(n,1) d(n,2) ... ... 0 ! # # # # # # $ % Outlook   Temp   Humidity   Windy   Play   Sunny   Hot   High   False   No   Sunny   Hot     High     True   No   Overcast     Hot       High   False   Yes   …   …   …   …   …   x 11 ... x 1f ... x 1p ... ... ... ... ... x i1 ... x if ... x ip ... ... ... ... ... x n1 ... x nf ... x np ! # # # # # # # # $ % Data Matrix 13 Dis/Similarity Matrix
  • 14. Prof. Pier Luca Lanzi Type of Data in Clustering Analysis •  Interval-scaled variables •  Binary variables •  Nominal, ordinal, and ratio variables •  Variables of mixed types 14
  • 15. Prof. Pier Luca Lanzi Distance Measures
  • 16. Prof. Pier Luca Lanzi Distance Measures •  Given a space and a set of points on this space, a distance measure d(x,y) maps two points x and y to a real number, and satisfies three axioms •  d(x,y) ≥ 0 •  d(x,y) = 0 if and only x=y •  d(x,y) = d(y,x) •  d(x,y) ≤ d(x,z) + d(z,y) 16
  • 17. Prof. Pier Luca Lanzi Euclidean Distances 17 here are other distance measures that have been used for Euclidean any constant r, we can define the Lr-norm to be the distance me ed by: d([x1, x2, . . . , xn], [y1, y2, . . . , yn]) = ( n i=1 |xi − yi|r )1/r case r = 2 is the usual L2-norm just mentioned. Another common d ure is the L1-norm, or Manhattan distance. There, the distance b points is the sum of the magnitudes of the differences in each dim called “Manhattan distance” because it is the distance one would •  Lr-norm •  Euclidean distance (r=2) •  Manhattan distance (r=1) •  L∞-norm 2 Euclidean Distances most familiar distance measure is the one we normally think of as “dis- e.” An n-dimensional Euclidean space is one where points are vectors of n numbers. The conventional distance measure in this space, which we shall to as the L2-norm, is defined: d([x1, x2, . . . , xn], [y1, y2, . . . , yn]) = n i=1 (xi − yi)2 is, we square the distance in each dimension, sum the squares, and take positive square root. is easy to verify the first three requirements for a distance measure are fied. The Euclidean distance between two points cannot be negative, be- e the positive square root is intended. Since all squares of real numbers are egative, any i such that xi ̸= yi forces the distance to be strictly positive. he other hand, if xi = yi for all i, then the distance is clearly 0. Symmetry ws because (xi − yi)2 = (yi − xi)2 . The triangle inequality requires a good of algebra to verify. However, it is well understood to be a property of
  • 18. Prof. Pier Luca Lanzi Jaccard Distance •  Jaccard distance is defined as d(x,y) = 1 – SIM(x,y) where SIM is the Jaccard similarity, •  Which can also be interpreted as the percentage of identical attributes 18
  • 19. Prof. Pier Luca Lanzi Cosine Distance •  The cosine distance between x, y is the angle that the vectors to those points make •  This angle will be in the range 0 to 180 degrees, regardless of how many dimensions the space has. •  Example: given x = (1,2,-1) and y = (2,1,1) the angle between the two vectors is 60 19
  • 20. Prof. Pier Luca Lanzi Edit Distance •  Used when the data points are strings •  The distance between a string x=x1x2…xn and y=y1y2…ym is the smallest number of insertions and deletions of single characters that will transform x into y •  Alternatively, the edit distance d(x, y) can be compute as the longest common subsequence (LCS) of x and y and then, d(x,y) = |x| + |y| - 2|LCS| •  Example: the edit distance between x=abcde and y=acfdeg is 3 (delete b, insert f, insert g), the LCS is acde which is coherent with the previous result 20
  • 21. Prof. Pier Luca Lanzi Hamming Distance •  Hamming distance between two vectors is the number of components in which they differ •  Or equivalently, given the number of variables n, and the number m of matching components, we define •  Example: the Hamming distance between the vectors 10101 and 11110 is 3. 21
  • 22. Prof. Pier Luca Lanzi Ordinal Variables •  An ordinal variable can be discrete or continuous •  Order is important, e.g., rank •  It can be treated as an interval-scaled § replace xif with their rank § map the range of each variable onto [0, 1] by replacing i-th object in the f-th variable by § compute the dissimilarity using methods for interval-scaled variables 22
  • 23. Prof. Pier Luca Lanzi Requirements of Clustering in Data Mining •  Scalability •  Ability to deal with different types of attributes •  Ability to handle dynamic data •  Discovery of clusters with arbitrary shape •  Minimal requirements for domain knowledge to determine input parameters •  Able to deal with noise and outliers •  Insensitive to order of input records •  High dimensionality •  Incorporation of user-specified constraints •  Interpretability and usability 23
  • 24. Prof. Pier Luca Lanzi Curse of Dimensionality in high dimensions, almost all pairs of points are equally far away from one another almost any two vectors are almost orthogonal