This document summarizes two state-of-the-art clustering techniques: Support Vector Clustering (SVC) and Bregman Co-clustering. SVC involves a two-phase process: 1) determining boundaries of clusters using a minimum enclosing ball approach, and 2) assigning cluster labels by finding connected components in a graph. Bregman Co-clustering aims to robustly handle missing or sparse data, high dimensionality, noise and outliers. The document discusses applications and desirable properties of these clustering methods, such as nonlinear separability and automatic detection of the number of clusters.
My INSURER PTE LTD - Insurtech Innovation Award 2024
State-of-the-art Clustering Techniques: Support Vector Methods and Minimum Bregman Information Principle
1. SUPERVISOR prof. Anna CORAZZA
UNIVERSITÀ degli STUDI di NAPOLI FEDERICO II
CO-SUPERVISOR prof. Ezio CATANZARITI
State-of-the-art Clustering Techniques
Support Vector Methods and Minimum Bregman Information Principle
by
VINCENZO RUSSO
VINCENZO RUSSO STATE-OF-THE-ART CLUSTERING TECHNIQUES: SUPPORT VECTOR METHODS AND MINIMUM BREGMAN INFORMATION PRINCIPLE
2. UNIVERSITÀ degli STUDI di NAPOLI FEDERICO II Introduction
What is the clustering?
Non-structured data
Unsupervised learning: groups a
set of objects in subsets called
clusters
The objects are represented as
points in a subspace of Rd
d is the number of point
CLUSTERING components, also called attributes
or features
3-clusters structure
Several application domains:
information retrieval, bioinformatics,
cheminformatics, image retrieval,
astrophsics, market segmentation,
etc.
VINCENZO RUSSO STATE-OF-THE-ART CLUSTERING TECHNIQUES: SUPPORT VECTOR METHODS AND MINIMUM BREGMAN INFORMATION PRINCIPLE
3. UNIVERSITÀ degli STUDI di NAPOLI FEDERICO II Goals
Two state-of-the-art approaches
Support Vector Clustering (SVC)
Bregman Co-clustering
Goals Application domain
Robustness w.r.t. Missing-valued Data Astrophysics
Robustness w.r.t. Sparse Data Textual documents
Robustness w.r.t. High “dimensionality” Textual documents
Robustness w.r.t. Noise/Outliers t
Synthetic data
t
Other desirable properties
Nonlinear separable problems handling
Automatic detection of the number of clusters
Application domain independent
VINCENZO RUSSO STATE-OF-THE-ART CLUSTERING TECHNIQUES: SUPPORT VECTOR METHODS AND MINIMUM BREGMAN INFORMATION PRINCIPLE
4. ers. Finally, the MEB was used for the Support Vector Domain Descriptio
9).
ption
sed for finding degli STUDI di Vector DomainSupportof classification (Tax, 2001; Tax an
rtunately, the SupportisNAPOLI FEDERICO classclass finding such(Tax, 2001; Tax and
n SVM formulation not the one II Description
DD), UNIVERSITÀ the MEB for enough. The process Vector Clustering a
an SVM formulation for the one classification
9a,b, smallestdetect 6SVDD firstthe(Tax,called Clusterthe the SVC and allows descri
re is the able toclass classification
or 1999a,b, 6 The the clusteris iswas firstlystep ofand toand allows describ-
2004). 2004). The SVDD basic 2001;modeled SVC by
only one boundaries which are Tax mapping
n, sphere to the data space. Thissphere the basic of Description
the
the enclosing phase was
step presented
Support Vector Clustering: the idea
D isetboundaries of clusters. determiningallows describ-
Hur the(2001). clusters. the SVC and the membership of points
undaries of Astep of (Vapnik, 1995). Later it was used
the al. basic second stage
nenkis (VC) dimension for
x1 ,of 2a ·high-dimensional called this points, with X though,Rd ,to adata space. W
eX = {xis,needed.·be a}Mapping n of nCluster Labeling, ⊆ Rd it
rtclusters · · x2 , n } Thenauthors distributiona(Schölkopf et al., thethe higher We
x , Nonlinear dataset of from points,space X ⊆
1 , x · · , x be a dataset step data with
5
data space.
ly does a cluster assignment.
e following subsections with X : Xoverview of the space. input
aset usedpoints, wefeaturean φ →XF→ F from the Wespace X X to some hig
of dimensional provide⊆ Rd , thefromSupport Vector Clus-
data the input
a nonlinear transformationVector Domain Description space to some high
was transformation φ space
inear n for the Support :
mensionalas originallyclass input Ben-Hur look(2001).theTax and enclosing sphere
nal feature spacespaceclassification etEnclosingthe smallest enclosing sphe
φalgorithm F we find the F,by space Xal. look for high (MEB), i.e. the
ation → feature the Minimumweto some smallest
In from proposed wherein
g : X for the one F, wherein we (Tax,for 2001; Ball
R. This weformalized asof all follows allows describ- having the
e SVDDsphereis enclosingsmallest enclosing sphere and
herein isThis basic step follows and
adius R.is the formalizedthe feature-space images
look for the as SVC
s Cluster cluster labeling probably descends from the originally proposed algorithm which
description
sters. minimum radius
5 follows
The name
meacluster labeling probably descends fromthe spherepresented to algorithm which is
Mapping smallest enclosing of, thedata space. in
e onformulation for the back on originally proposed
SVM dataset then points, with X ⊆ sphere was firstly algorithmscontours the connect
of connected componentsRd:a graph: the splitsWe for finding
d finding
ding the connected originally proposed algorithm which is usedfinding the connected
in the Vapnik-Chervonenkis (VC) the input graph:1995).algorithms for
from the components of a of Supportthe vertices.
dimension (Vapnik, the Later it was
descends The assign the “components labels” totoVectors high and describe
mation φusually contours constist space X
ponents : X → F from some (SV),
6nents of assign the of algorithms for finding the (Schölkopf
usually a support “components labels” to the vertices.
stimating thegraph: the a high-dimensional distribution connectedet al.,
eAn alternative clusters for thefor the same task, called One Class SVM, can be found
. F, wherein SVM formulation Support Vector called One Class SVM, can be found in
Finally,SVM formulation for the smallest enclosing sphere
rnative
we look
the to the vertices.
the MEB was used for the same task, Domain Description
onents labels”
ölkopf et al. (2000b) (seefor the one class classification (Tax, 2001; Tax and
D), an SVM formulation Appendix A).
lized as follows
al. (2000b) (see Appendix A). Class SVM,
for the same 6task,SVDD is the basic step of the can be found describ-
, 1999a,b, 2004). The called One SVC and allows in
A).
he boundaries of clusters. the originally proposed algorithm which is
robably descends from
be a dataset of n points, with X ⊆ Rd , the data space. We 9
d components of
= {x1 , x2 , · · · , xn }a graph: the algorithms for finding the connected
91
nonlinear transformation φthe vertices. the input space X φ−1some high
“components labels” to : X → F from to : F → X
nsional feature space F, wherein we look for the smallest enclosing sphere
ulation for the same task, called One Class SVM, can be found in
91
dius R. This is formalized as follows
ppendix A).
he name cluster labeling probably descends from the originally proposed algorithm which is
on finding the connected components of a graph: the algorithms for finding the connected
onents usually assign the “components labels” to the vertices.
n alternative SVM formulation for the same task, called One Class SVM, can be found in
91
opf et al. (2000b) (see Appendix A).
VINCENZO RUSSO STATE-OF-THE-ART CLUSTERING TECHNIQUES: SUPPORT VECTOR METHODS AND MINIMUM BREGMAN INFORMATION PRINCIPLE
5. i j i,j=1,2,··· ,n
cluster labeling. To calculat
rnel. UNIVERSITÀ degli STUDI di NAPOLI FEDERICO II Support Vector Clustering
analyze each step separately
subject to
Phase I: Cluster K(·) φ(xk ) −
he kernel functiondescription a
defines an2
≤explicit 1, 2, · · · , n if φ is known, othe
R2 , k = mapping
6.1.5.1 Cluster description
apping thesaid toof the sphere. In the majority ofincorporatedfunction φ is u
is center be implicit. Soft constraints are cases, the by adding
here a is Finding the Minimum Enclosing Ball (MEB)
we can implicitly Class
ck variables ξk perform an inner product in the feature space F clus
The complexity of the by
Class
Nonlinear Support Vector Domain Description (SVDD) we have
-1 (see-1Equation 6.3)
kernel K. CHAPTER 6. SUPPORT VECTOR CLUSTERING
QP problem; computational complexity O(n3 ) worst-case running ti
sing nonlinear kernel transformations, we have a chance to transform a
n the QP problem can be sol
able problem + Cdata (Squared Feature Space Distance) . in be a Parameters Optmiza
min R2 Definition ξkspace to a separable oneLet Sequential Minimal(see Fig
in 6.1 x feature space (6.2)
data point. We define
R,a,ξ
PORT VECTOR CLUSTERINGA nonlinear separablespace F, φ(x), from the center = kernel width metho
Figure the distance of its image in feature problem in the data space Xsphere, a, as
2.3: k=1 of the that becomes line
tionqmethods. These
follows
the feature space F. timeC = soft margin (approx
complexity to
subject to
6.1.1 Valid Mercer kernels in R2 (x) = φ(x) − a 2 reduced to O(1)(6.13)
dR subspaces
n
2 2
(Ben-Hur e
Squared Feature Space Distance) . φ(xx ) − a point. We ξk , k = 1, 2, · · · , n
Let k be a data ≤ R + define
here are severalF, φ(x), from thewhich the k = 1,of the ·kernelsatisfythe kernelized cond
image in feature spaceview of Equation 6.6 and the are known to we have Mercer’s
In functions center of definition a, as· , n
ξk ≥ 0, sphere, 2, ·
n In polynomial kernels, the parameter k is 6.1.5.2 Cluster labeling co
version of the above distance
the degree. In the expon
⊆ R . Some of them are
solve this problem φ(x)introduce the−parameter q is called kernel width. isThe k
kernels=(and − 2 (x) = K(x, x) Lagrangianx) +
d2 (x)
we others), the 2 β K(x , (6.13)
dR 2
a
n n n
TheK(xk , xl ) labeling comp
βk βl cluster (6.14)
R k k
• Linear kernel: K(x, y) = xy meaning depending on the kernel: in th
has different mathematical k=1 k=1 l=1jacency matrix A (see Equa
tion 6.6 and the definition solutionkernel functioni.e.kernelized Lagrangian multipliers associ-undirect
Gaussian kernel, 2it vector we have the only the
Since the of the is a β is sparse, of the variance components of the
L(R,distance µ) kernel need to + vectors= (xy− Gaussianrewritekthe aboveusedkone n = n
ove Polynomial kernel: K(x,knormalized. a r) ,kr ≥ 0, k most equation(6.3)
• a, ξ; β, =ated − the support ξ y) are non-zero, we can is theµk +N n × n, where
The R to 2
(R be − φ(xk ) + )β − 2k
ξ ∈ C
matrix is ξ as
n follows k n n 2 1 In the first ksub-step we hav
k
) • K(x, x) − 2 kernel: K(x, y)β=l K(xk , xl )
= Gaussian βk K(xk , x) + β e n , qq >n 0 n 2 (s) where s is anyone of
−q x−y
=
(6.14)
≥ 0 x) + = dRβl · · · , n.
k sv sv sv
th Lagrangian multipliers(x)k=l=1 0 x) − 2 µk βk K(xk ,for k 2σ 1,k2,K(xk, xl ) The posi-
k=1
2 β ≥ and
dR k=1 K(x, β (6.15)
e• Exponential kernel: K(x, y) = k=1 x−y , q k=1 l=1point y sampled along the p
e−q
real constant C provides a way to control outliers> 0
13
n vector βKernel width is aLagrangian term SUPPORTassoci- percentage, allowing the
is sparse, i.e. only the general multipliersindicate theMINIMUM BREGMANwhich data is
VINCENZO RUSSO STATE-OF-THE-ART CLUSTERING TECHNIQUES: to VECTOR METHODS AND scale at INFORMATION PRINCIPLE
6. en a pair of data points that belong to different clusters, any β ← clusterD 5: path that c
UNIVERSITÀ degli STUDI di NAPOLI FEDERICO II Support Vector Clustering 6: results ← clu
ts them must exit from the sphere in the feature space. Therefore, such a p
7: choose new q
tains a segment of points y such that dR (y) > R. This leads to the definition
Phase II: Cluster labeling 8: end while
adjacency matrix A between all pairs of points whose images lie in or on 9: return results
ere in feature space. 10: end procedure
The Phase I only describes the clusters’ boundaries
Sij be the line segment connecting xi and xj , such that Sij = {xi+1 , xi+2 , ·
2 , xj−1 } Phasei,II:= 1, 2, · · · , n, connected components of the graph
for all j finding the then 6.1.5 Complexity
induced by the matrix A
We recall that the SVC
1 if ∀y ∈ Sij , dR (y) ≤ R cluster labeling. To calcu
Aij = (6
0 otherwise. analyze each step separa
sters are now defined as the connected components of the graph induced
Sij = {xsegment is · · , xj−2 , xj−16.1.5.1 Cluster a num
matrix A. Checking the line i+1 , xi+2 , · implemented } sampling
by
descrip
f points between the starting point and the ending point. The exactness ofc
The complexity of the
Each component is a cluster (see Equation 6.3) we h
ends on the number m. O(n3 ) worst-case runnin
Original Phase II is a bottleneck (caso peggiore their problem can be
)
arly, the BSVs are unclassified by this procedure sincethe QP feature space
Alternatives
s lie outside the enclosing sphere. One may decide either to leave them
Sequential Minimal Optm
sified orCone Cluster Labeling:cluster that they are closest to. Generally,
to assign them to the best performance/accuracy methods. These me
tion rate
time complexity to (app
Gradient Descent
er is the most appropriate choice.
reduced to O(1) (Ben-Hu
VINCENZO RUSSO STATE-OF-THE-ART CLUSTERING TECHNIQUES: SUPPORT VECTOR METHODS AND MINIMUM BREGMAN INFORMATION PRINCIPLE
7. UNIVERSITÀ degli STUDI di NAPOLI FEDERICO II Support Vector Clustering
Pseudo-hierarchical execution
Parameters exploration
The greater the kernel width q, the greater the number
of support vectors (and so of clusters)
C rules the number of outliers and allows to deal with
strongly overlapping clusters
Brute force approach unfeasible
Approaches proposed in literature
Secant-like algorithm for q exploration
No theoretical-rooted method for C exploration
Data analysis is performed at different levels of detail
Pseudo-hierarchical: strict hierarchy not guaranteed
when ‘C < 1’, due to the Bounded Support Vectors
VINCENZO RUSSO STATE-OF-THE-ART CLUSTERING TECHNIQUES: SUPPORT VECTOR METHODS AND MINIMUM BREGMAN INFORMATION PRINCIPLE
8. UNIVERSITÀ degli STUDI di NAPOLI FEDERICO II Support Vector Clustering
Proposed improvements
Soft Margin C parameter selection
Heuristics: successfully applied in 90% of cases
Only 10 tests out of 100 needed further tuning
10 datasets had a high percentage of missing values
New robust stop criterion
Based upon Relative evaluation criteria (C-index, Dunn
Index, ad hoc)
Kernel width (q) selection
SVC integration O(Qn3 ) O(n2 )
sv
Softening strategy heuristics
For all normalized kernels
More kernels
Exponential ( K(x, y) = e−q x−y ), Laplace (K(x, y) = e−q|x−y| )
VINCENZO RUSSO STATE-OF-THE-ART CLUSTERING TECHNIQUES: SUPPORT VECTOR METHODS AND MINIMUM BREGMAN INFORMATION PRINCIPLE
9. UNIVERSITÀ degli STUDI di NAPOLI FEDERICO II Support Vector Clustering
Improvements - Stop criterion
Detected clusters Actual clusters Validity index
1 3 1,00E-06
Breast Iris
3 3 0,13
4 3 0,05
1 2 1,00E-05
2 2 0,80
4 2 0,27
The bigger the Validity index the better the clustering
found
The stop criterion halt the process when the index value
start to decrease
The idea: the SVC outputs quality-increasing clusterings
before reaching the optimal clustering. After that, it
provides quality-decreasing partitionings.
VINCENZO RUSSO STATE-OF-THE-ART CLUSTERING TECHNIQUES: SUPPORT VECTOR METHODS AND MINIMUM BREGMAN INFORMATION PRINCIPLE
10. UNIVERSITÀ degli STUDI di NAPOLI FEDERICO II Support Vector Clustering
Improvements - Kernel width selection
Algorithm Accuracy Macroaveraging # iter # potential “q”
SVC 88,00% 87,69% 2 9
Iris
+ softening 94,00% 93,99% 1 13
K-means 85,33% 85,11% not applicable
SVC 87,07% 87,55% 3 7
B. Cancer Syn03 Syn02 Wine
+ softening 93,26% 93,91% 2 6
K-means 50,00% 51,78% not applicable
SVC 88,80% 100,00% 8 18
+ softening 88,00% 100,00% 4 15
K-means 68,40% 63,84% not applicable
SVC 87,30% 100,00% 17 36
+ softening 87,30% 100,00% 6 31
K-means 39,47% 39,90% not applicable
Benign
Contamination
SVC 91,85% 11,00% 3 11
+ softening 96,71% 2,82% 3 13
K-means 60,23% 32,00% not applicable
VINCENZO RUSSO STATE-OF-THE-ART CLUSTERING TECHNIQUES: SUPPORT VECTOR METHODS AND MINIMUM BREGMAN INFORMATION PRINCIPLE
11. UNIVERSITÀ degli STUDI di NAPOLI FEDERICO II Support Vector Clustering
Improvements - non-Gaussian kernels
Exponential Kernel: improves the cluster separation in several cases
Algorithm Accuracy Macroaveraging # iter # potential “q”
SVC + softening 94,00% 93,99% 1 13
Iris
+ Exp Kernel 97,33% 97,33% 1 15
K-means 85,33% 85,11% not applicable
SVC + softening Failed - only one class out of 3 separated
CLA3
+ Exp Kernel 94,00% 93,99% 1 11
K-means 85,33% 85,11% not applicable
Laplace Kernel: improves/allows the cluster separation with
normalized data
Algorithm Accuracy # iter # potential “q”
SVC + softening Failed - no class separated
SG03 Quad
+ Laplace Kernel 99,94% 1 17
K-means 83,00% not applicable
SVC + softening 73,15% 3 19
+ Laplace Kernel 91,04% 1 16
K-means 50,24% not applicable
VINCENZO RUSSO STATE-OF-THE-ART CLUSTERING TECHNIQUES: SUPPORT VECTOR METHODS AND MINIMUM BREGMAN INFORMATION PRINCIPLE
12. φ 2 1 C, expectation of 2 its interior2
efine the relative 1interior of the set the denoted1 ri(C), as random variable X.
2
the
2
(C) the gradient of STUDI di is the dot product, and ri(S) is the relative interior of
φ is UNIVERSITÀ degli φ, · NAPOLI FEDERICO II Minimum Bregman Information Principle
Proposition 5.1 Let X be a random variable that takes values in X
ri(C) = {x ∈ C : B(x, r) ∩ aff(C) ⊆ C following r > 0},
Bregman Co-clustering (BCC) Rd for some a positive probability distribution measure ν such that
(5.2)
Given a Bregman divergence dφ : S × ri(S) → [0, ∞), the problem
) is the ball of radius r and center x (Boyd and Vandenberghe, 2004,
e 5.1 (Squared Euclidean Distance) clustering of both distance is perhaps
Co-clustering: simultaneous Squared Euclidean rowsEand columns min [d (X, s)]
lest and most widely used Bregman divergence. The underlying ri(S) ν φ φ(x) =
of a data matrix s∈ function
strictly convex, differentiable in Rd and
Bregman framework has a unique solution given by s∗ = µ = Eν [X].
gman divergences
Generalizes K-meansUsing the proposition above, we can now give a more direct
strategy
e Bregman divergences (Bregman, 1967), which form a large class of
dLargexclass of ,divergences: Bregman 2 ,(BI). 2 ) =
Bregman Information
φ (x1 , 2 ) = x1 x1 − x2 , x2 − x1 − x divergences
φ(x
d loss functions with a number of desirable properties.
Minima Bregman1 Information (MBI) principle= (5.4)
= x1 , x − x2 , x2 − 5.2 (Bregman2Information) Let X be a random variab
Definition x1 − x2 , 2x
1 (Bregman divergence) Let φ be a in X = {xiconvex function of Leg- a positive probability distrib
real-valued }n ⊂ S 2 Rd following
Meta-algorithmdom(φ)1 Let R2.=The[X] − x2n ⊆ν x ∈ ri(S) and let d : S × ri(S) → [0,
= Sx1 − x2 , x ⊆ xd = x1
fined on the convex set ≡ −
µ
i=1
E Bregman divergence
= ν i=1 i i φ
→ R+ is defined asconsists of all
interior of a set C divergence points of C that arethe Bregmannot on the “edge” in terms of dφ is de
divergence. Then intuitively Information of X of C
Bregman Bregman Information
d Vandenberghe, 2004, app. A). n
d (x1 , x2 = φ(x1 − φ(x2 ) φ x − x to be 2 ) Iφ (X) = (5.3) (i) int(dom(φ))φ (xi , µ)
roper, φclosed,) convex )function − is 1 said 2 , φ(xof Legendre type νif: φ (X, µ)] =
E [d νi d
mpty, with φ convex, real, dot product, and ri(S) is theon int(dom(φ)),ofand (iii) ∀zb ∈
he gradient of is strictly convex and differentiable relative interior
(ii) φ, · is the differentiable i=1
φ)), limz∈dom(φ)→zb φ(z) → ∞, where dom(φ) is the domain of the φ application, d
Example 5.3 (Variance) Let X = {xi }n be a set in R , and con
i=1
φ)) is the interior of the domain of φ measure over X , i.e.isνthe boundary of the domain of of X with
Divergence and bd(dom(φ)) iMBI1/n. The Bregman Information
Information = Algorithm
ee et al., 2005c).
Euclidean Variance as Bregman divergence is actually K-means
distance Least Squares the variance
(Squared Euclidean Distance) Squared Euclidean distance is perhaps
nd most widely usedEntropy divergence. The underlying function φ(x) =
Relative Bregman Mutual Information Maximum Entropy n
unnamed n
1
ly convex, differentiable in R and
d
Iφ (X) = νi dφ (xi , µ) = 57 xi − µ 2
Itakura-Saito unnamed unnamed i=1 Lindo-Buzo-Gray n i=1
where
dφ (x1 , x2 ) = x1 , x1 −STATE-OF-THE-ART CLUSTERING2 , φ(x2SUPPORT VECTOR METHODS AND MINIMUM BREGMAN INFORMATION PRINCIPLE
VINCENZO RUSSO x2 , x2 − x1 − x TECHNIQUES: ) = n n
13. UNIVERSITÀ degli STUDI di NAPOLI FEDERICO II Other experiments
Sparse data and missing-valued data
Star/Galaxy data with missing values
Dataset SVC BCC K-means # attr. affected % obj. affected
MV5000 (25D) 99,02% 94,00% 71,08% 10 27,0%
MV10000 (25D) 96,10% 95,60% 75,12% 10 29,0%
AMV5000 (15D) 91,76% 79,46% 74,90% 6 30,0%
AMV10000 (15D) 90,31% 83,51% 68,20% 6 30,0%
Textual document data: sparsity and high “dimensionality”
Dataset SVC BCC K-means
CLASSIC3 (3303D) 99,80% 100,00% 49,80%
SCI3 (9456D) failed 89,39% 39,15%
PORE (13821D) failed 82,68% 45,91%
VINCENZO RUSSO STATE-OF-THE-ART CLUSTERING TECHNIQUES: SUPPORT VECTOR METHODS AND MINIMUM BREGMAN INFORMATION PRINCIPLE
14. UNIVERSITÀ degli STUDI di NAPOLI FEDERICO II Other experiments
Outliers
Dataset SVC Best BCC K-means # objects # outliers
SynDECA 02 100,00% 94,18% 68,04% 1000 112
SynDECA 03 100,00% 49,00% 39,47% 10000 1.270
9.8. MISSING VALUES IN ASTROPHYSICS:
SynDECA 02 SynDECA 03
VINCENZO RUSSO STATE-OF-THE-ART CLUSTERING TECHNIQUES: SUPPORT VECTOR METHODS AND MINIMUM BREGMAN INFORMATION PRINCIPLE
15. UNIVERSITÀ degli STUDI di NAPOLI FEDERICO II Conclusions and future works
Conclusions
Support Vector Clustering achieves the goals
Goals Application domain
Robustness w.r.t. Missing-valued Data Astrophysics
Robustness w.r.t. Sparse Data Textual documents
Robustness w.r.t. High “dimensionality” Textual documents
Robustness w.r.t. Noise/Outliers Synthetic data
Other properties Application domain
Automatic discovering of the number of clusters
Application domain independent
Whole experimental stage
Nonlinear separable problems handling
Arbitrary-shaped clusters handling
Bregman Co-clustering achieves same goals, but the following still hold
the problem of estimating the number of clusters
outliers handling problem
VINCENZO RUSSO STATE-OF-THE-ART CLUSTERING TECHNIQUES: SUPPORT VECTOR METHODS AND MINIMUM BREGMAN INFORMATION PRINCIPLE
16. UNIVERSITÀ degli STUDI di NAPOLI FEDERICO II Conclusions and future works
Contribution
SVC was made applicable in practice
Complexity reduction for the kernel width selection
Soft margin C parameter estimation
New effective stop criterion
non-Gaussian Kernels
The kernel width selection was shown to be applicable
to all normalized kernels
Exponential and Laplacian kernel successfully used
Improved accuracy
Softening strategy for the kernel width selection
VINCENZO RUSSO STATE-OF-THE-ART CLUSTERING TECHNIQUES: SUPPORT VECTOR METHODS AND MINIMUM BREGMAN INFORMATION PRINCIPLE
17. UNIVERSITÀ degli STUDI di NAPOLI FEDERICO II Conclusion and future works
Future works 10.3. FUTURE WORK
Itakura-Saito Minimum Enclosing Bregman Ball (MEBB)
Generalization of the Minimum Enclosing
10.3. FUTURE WORK
Ball (MEB) problem and the Bâdoiu-
Clarkson (BC) algorithm with Bregman
Bregman Balls
L2
divergences
Itakura-Saito 2 Kullbach-Leibler
Squared Euclidean
Fig. 2. Examples of Bregman Balls, for d = 2. Blue dots are the centers of the balls.
Figure 10.1: Examples of Bregman balls. The two ones on the left are balls obtained by means of
Core Vector Machines (CVM)
the Itakura-Saito distance. The middle one is a classic Euclidean ball. The other two are obtained
by employing F isKullback-Leibler distance F . A Bregman divergence has the following
Here, the the gradient operator of (Nock and Nielsen, 2005, fig. 2).
properties: it is convex in x’, always non negative, and zero iff x = x’. Whenever
SVM reformulated as MEB problem
d
F (x) = i=1 x2 = x 2 , the corresponding divergence is the squared Euclidean
i 2
distancedata, therefore we can take2 ,itwith which is associated the common the
tion of the (L2 ): DF (x’, x) = x − x’ 2 into account for much research in
2
definition of a ball in an Euclidean metric space:
SVC and generally in the SVM. In fact, we wish to recall that the classical BC al-
Itakura-Saito L22 Kullbach-Leibler
B algorithm exploited 2 ≤ r} ,
2
The CVMs reformulate the SVMs as a MEB problem and we already expressedThey make use of the BC algorithm
gorithm is the optimizationc,r = {x ∈ X : x − c by the already mentioned CVMs. (2)
Kullback-Leibler
with c ∈ S the center of the ball, and r ≥ 0 its (squared) radius. Eq. (2)
Fig. 2. Examples of Bregman Balls, for d = 2. machines WORKcenters of the balls.
10.3. FUTUREare the cluster description
our will of testing such Blue dots left are balls obtained by means ofstage of the SVC (see
for the
MEBB + CVM = Bregman Vector Machines
e 10.1: Examples of Bregman balls.natural generalization to the definition of balls for arbitrary Bregman
suggests a The two ones on the
section 6.12). Since the Euclidean ball. The other twogeneralized to Bregmanany
BC algorithm has been areusually not symmetric, diver-
kura-Saito distance. The middle one is a classic since a Bregman divergence is obtained
divergences. However,
ploying F isKullback-Leibler distance Fr. ≥about vector 2005, fig.dual Bregman balls:
Here, the the gradient∈ S and any(NockBregman divergence has the following about the SVC) could
gences, the research 0and Nielsen, machines2).
c operator of A define actually two
(and therefore
have very interesting implications. We definitely intend to explore this way.
New implications for vector machines
roperties: it is convex in x’, always non negative, and zero iff x = x’. Whenever
d Bc,r = {x ∈ X : DF (c, x) ≤ r} , (3)
(x) = i=1 x2 = x 2 , the corresponding divergence is the squared Euclidean
i 2
istance (L2 ): DF10.3.2 = can take2 ,itand Bc,r = {x for much research in
x − x’ 2 into extend the :SVC software
∈ X DF (x, c) ≤ r} .
of the data, therefore we Improve with which is associated the common the
2 (x’, x) account
(4)
New implications for SVC
efinition of a ball in an Euclidean metric space:
and generally in theRemark In fact,F (c, wish always convex thecclassicalFBC al- is not always, but
SVM. that D we x) is to recall that in while D (x, c)
For the boundaryaccuracy not2 always convexperform more x, given comparisons with
thealgorithmX :c,r is and≤the already (it depends on robust c), while ∂Bc,r
sake of in order to
hm is the optimizationc,r = {x ∈ ∂B x − c by r} ,
B exploited 2 mentioned CVMs.
(2)
other clustering algorithms, an improved and extended software,r because of
for the Support
CVMs reformulate thealways convex. In this paper, we we already interested in Bc
is SVMs as a MEB problem and are mainly expressed
ith c ∈ S the center ofconvexity of(SVC)≥ 0needed. More of the paper extends
Vector Clusteringand r is The (squared) radius. Eq. and reliability is necessary.
its conclusion stability (2)
Adapting cluster labeling algorithms to
the the ball,
will of testing such machines for the DF in c.description stage of the SVC (see some results to
cluster
uggests a natural Moreover, it,r to the definition2 presents some examples of Bregman to this promising
generalization as well. Figure implement arbitrary Bregman
build Bc is important to of balls for all the key contributions balls for three
n 6.12). Since the BC algorithm has been generalized to Bregmanany diver-
ivergences. However, since a proposed all around the world. In fact, all analytic expressions of the
technique Bregman divergence is usually not symmetric, the tests have been currently
popular Bregman divergences (see Table 1 for the
es, S and any r ≥about vector machines (and thereforeof m points SVC)were sampled from X . A
∈ the research performed by exploiting only some ofabout the that could
0 define actually two dual Bregmanset the characteristic and/or special contribu-
the Bregman divergences
divergences). Let S ⊆ X be a balls:
very interesting implications. We definitely intend to explore this way. ∗
tion smallest {x ∈ X : DBregman ball ,(SEBB) for S is a Bregman ball B c∗ ,r∗ with r
at time. enclosing (c, x) ≤ r}
Bc,r = (3)
F
the minimal real such that S ⊆ Bc∗ ,r∗ . With a slight abuse of language, we will
L2 refer to {x ∈ X : DF (x, c) ≤ r} .
Bc,r = rKullbach-Leibler
∗ (4)
2 Improve and extend as the radius of the ball. Our objective is to approximate as best as
2
the SVC software
possible the SEBB of S, which amounts to minimizing the radius of the enclosing
Remark that DF (c, x) is always convex in c while DF (x, c) is not always, but
he boundaryaccuracy not always convexperform matterx, givenindeed, the SEBB is unique.
he sake of for d c,r2. Blue dots order to (it depends on robust comparisons,r
man Balls, ∂B = is and in are the centers of the balls.
ball we build. As a simple more of fact
c), while ∂Bc with
man balls. The two ones on an improved and extended software for the Support
always convex. In this
the left are balls obtained by means of
clustering algorithms, paper, we are mainlyenclosing Bregman ball Bc∗ ,r∗ of S is unique.
Lemma 1. The smallest interested in Bc,r because of
Euclidean ball. The other two are obtained
rmiddle one is ofclassicin c. The conclusion stability and reliability is necessary.
a
he Clustering (SVC) is needed. More of the paper extends some results to
convexity
over, it,r of
DF
ibler distance F . A Bregman divergence has the following
t operatoras well. Figure implement fig. the key contributions to this promising
(Nock and Nielsen, 2005, 2).
uild Bc is important to 2 presents some examples of Bregman balls for three
all
The End
n x’, always non all around the world. =1x’. Whenevertestsexpressions of the
opular Bregmannegative, and(see Table In fact, all analytic have been currently
VINCENZO zero iff x for the theSTATE-OF-THE-ART CLUSTERING TECHNIQUES: SUPPORT VECTOR
ique proposed divergences RUSSO
, the corresponding divergence is the squared Euclidean
METHODS AND MINIMUM BREGMAN INFORMATION PRINCIPLE