6. Why Dimension Reduction?
• Mankind generates over 1000 exabytes of
data each year we are facing a data deluge
• Rapid development of data collection
technology
http://www.ivizsecurity.com
7. Why Dimension Reduction?
• Classical data mining methods were designed
for use in application domains where the data
has low dimension
• Machine Learning/Data mining methods may
not be effective for high-dimensional data…
--Curse of dimensionality: the number of observations needed
to estimate an arbitrary function with a given level of accuracy
grows exponentially with the number of dimensions
8. Solution:
• Dimension reduction aims at mapping the
data to lower dimensional space so that
• Dimensionality reduction facilitates
Noise removal
Classification, prediction
Visualization e.g. projection of high-dimensional data onto 2D
Communication and storage of high-dimensional data
Uninformative variance in the data is removed
A subspace in which the data resides is detected
12. Techniques for Dimension Reduction—
feature extraction
• Feature extraction
--Assumption: The data resides in a low-
dimensional space
--Aim: find the low-dimensional
representation of the data points:
--The new feature are linear combinations of
the original features
14. Techniques for Dimension Reduction-
Feature selection
• Feature selection:
--Assumption: Among the available features,
only a certain subset is truly relevant (e.g. for
subsequent classification etc.)
--Aim: Selecting a subset of features that are
most informative.
15. Techniques for Dimension Reduction-
Feature selection
• Benefits:
--Improve subsequent mining performance(e.g.
classification accuracy)
--Reduces speed of subsequent computations
--Improves interpretability of the results.
16. Feature extraction/reduction
• Unsupervised Feature reduction: does not use
labels.
Principal Component Analysis (PCA)
Non-linear PCA
Multidimensional scaling
Manifold learning algorithms
Independent Component Analysis (ICA)
17. Feature extraction/reduction
• Supervised feature reduction: makes use of
labels
• Semi-supervised: used labeled and unlabeled
data
Linear Discriminant Analysis (LDA)
Supervised PCA
Canonical Correlation
Partial Least Squares (PLS)
Independent Component Analysis (ICA)
18. PCA
• Converts a set of possibly correlated
variables into a set of linearly uncorrelated
ones: the principal components,
where
The 1st PC accounts for the as much variability in
the data as possible
The 2nd PC accounts for the as much variability
in the data as possible with the constraint that
it should be orthogonal to the 1st PC.
19. LDA
• Task: Classification
• Goal: Finds a linear combination of features
that best separate the data classes.
• Assumption: Assume the data points for each
class comes from a Gaussian distribution with
same variance, but different means.
• Estimate means and variances and apply
maximum likelihood principle.
20. LDA
• Linear Discriminant functions for class k
o Within-class scatter matrix is
o Between-class scatter matrix is
o The projection matrix :
21. LDA: Sample applications
• Classification of patients based on microarray
data
- Regularized linear discriminant analysis and its application in microarrays
Guo et al, 2007
• Marketing:
- Determine the factors which distinguish different types of
customers and products on the basic of surveys or other
forms of collected data
• Character recognition:
- Improve Handwritten Character Recognition Performance.
Ueki et al, 2008
23. Feature Selection
• Feature Selection can be viewed a “discrete
version” of Feature Reduction:
• Only a subset of the features are kept
• No transformation is applied
• Main advantage compared to feature
reduction:
Ease of Interpretation!
Irrelevant yet possibly costly features need not be collected anymore.
24. Feature Selection: Filter Methods
• All features
• The features are selected regardless of the subsequent task as
hand
• Relies on general characteristics of the data
• Not tailored to any learning algorithm
• Fast and easy
Predictive ModelAll Features Filter
Subset of Features
28. Limitation of traditional graph
embedding dimensionality reduction
most of the state-of-the-art graph embedding
dimensionality reduction methods require an
affinity graph constructed before hand, which
makes their projection ability dependent on
the input of the graph to a large extent.
29. Proposed method
• A novel graph embedding method for
unsupervised dimensionality reduction.
30. Graph Embedding Discriminative
Unsupervised Dimension Reduction
Proposed dimensionality
reduction method
Dimensionality
Reduction
Graph Construction
assigning the
adaptive and
optimal neighbors
based on the
projected local
distances
32. Traditional Method
• The total distance between all pairs of data
points would be
Here H is the centering matrix, defined as:
33. Traditional Method
• Traditionally, dimensional reduction methods
solve the following problem, which require a
graph constructed before hand:
34. Traditional Method
• Separate the dimensionality reduction and
graph construction
• Thus its highly dependent on the input of the
graph.
Traditional
Method
Graph
Construction
Dimension
Reduction
35. Proposed method
• Consider an affinity matrix , which
implies the probability of each data point in X
to connect with its neighbors.
• a larger probability should be assigned to a
pair with smaller distance
37. Trivial Solution
• Above problem has a trivial solution that only
the nearest data point of is assigned a
probability as 1.
• while all others are assigned 0, that is to say,
they are not the neighbor of .
38. How to avoid trivial solution
• To avoid the trivial solution, we add two
constraint on the graph to make its structure
clear.
1. The probability within cluster should be nonzero
while the probability between cluster should be zero.
2. The probability within a certain cluster should be
equally distributed.
39. How to accomplish it?
• Given , suppose each node is assigned
a function value as , then it can be
verified that
40. How to accomplish it?
• Where is the Laplacian matrix in
graph theory.
• Here we define a diagonal matrix as the
degree matrix . For the diagonal matrix,
the i-th diagonal element is
41. Theorem 1
• If the Laplacian matrix has an important
property when the probability matrix is
nonnegative. The property is described as
follows:
Theorem 1 The multiplicity k of the
eigenvalue 0 of the Laplacian matrix is
equal to the number of connected
components in the graph associated with S.
42. Theorem 1
• If , the graph possesses an ideal structure
which could explicity partition the data points into
exactly k clusters according to the block diagonal
structure.
• Thus the problem becomes:
43. Optimization Algorithm Solving
Problem
• The above Problem seems very difficult to
solve. So next we proposed optimization
algorithm Solving Problem
• Suppose is the i-th smallest eigenvalue
of . It is easy to see that since is
positive semi-definite. So for a large enough
45. Optimization Algorithm Solving
Problem
• From the above equation, a large enough
would guarantee that the k smallest
eigenvalues of are zero and thus the rank of
is .
• According to the Ky Fan’s Theorem (Fan,
1949), we have
47. Alternative optimization Method
• The first step is fixing and solving . Then
problem becomes:
• The optimal solution of F is formed by the k
eigenvectors corresponding to the k smallest
eigenvalues of .
49. Alternative optimization Method
• We can use a iterative re-weighted method to
solve . The Lagrangian function of above
problem is
• Taking derivative w.r.t. W and set it to zero, we
have:
50. Alternative optimization Method
• The solution of W in above Problem is formed
by the m eigenvectors corresponding to the m
smallest eigenvalues of the matrix:
54. Summary for Algorithm
• Input: Date matrix , number of clusters
k, reduced dimension m, parameter , a large
enough .
• Output: Projection and probability
matrix with exactly k connected
components.
55. Summary for Algorithm
• Initialize S by the optimal solution to the
problem without the constraint on
• while not converge do
• 1. Update , where is a
diagonal matrix with the i-th diagonal element
as
56. Summary for Algorithm
• 2. Update F, whose columns are the k
eigenvectors of corresponding to the k
smallest eigenvalues.
• Update W, whose columns are the m
eigenvectors of matrix in Eq. (15)
corresponding to the m smallest eigenvalues.
Update W iteratively until converges.
• 3. For each I, update the i-th row of S by
solving the problem (18). End while
58. Experiments on Synthetic Data
• The synthetic data in this experiment is a
randomly generated two-Gaussian matrix.
• We generate two clusters of data which obeys
the Gaussian distribution.
59. Experiments on Synthetic Data
• Our goal here is to find an effective projection
direction in which the two clusters could be
explicitly set apart.
• We compare our dimensionality reduction
method DUDR with two related methods. PCA
and LPP.
60. (a) Cluster far away
• When these two clusters are far from each other, all
these three methods could easily find a good
projection direction.
61. (b) Clusters relatively close
• However, as the distance between these two
clusters lower down, PCA becomes
incompetent.
62. (c) Clusters fairly close
• As the two clusters draw closer, LPP also lose its way
to achieve the projection goal. Whereas the DUDR
method consistently works well under all occasions.
63. Reason for this phenomenon
1. PCA is a method focused on the global structure, so when
the two clusters approach to a certain extent, PCA is unable to
distinguish one cluster from the other thus fails immediately.
2. As for LPP, it pays more attention to the local structure, thus
works well when two cluster is relatively close. But when the
distance becomes even smaller, LPP is not capable anymore.
3. But our method, DUDR, lays more emphasis on the
discriminative structure and thus is able to keep its projection
ability all the time.
64. Experiments on Real Benchmark
Datasets
• 15 benchmark datasets:
• Five are shape set data, six are data sets from
UCI Machine Learning Repository and the other
four are image data sets.
Ecoli Pathbased Aggregation Compound
Breast
Cancer
Yeast R15 Glass Spiral Abalone
Movements Jaffe AR ImData XM2VTS Coil20
66. Experiments on Projection
• We evaluated our dimensionality reduction
method on the 5 benchmark data sets with
high dimensions: AR ImData, Movements,
Coil20, Jaffe, XM2VT.
67. Experiments on Projection
• The comparison is based on the clustering
experiments, where we first learn the
projection matrix separately with these three
methods and then run K-means on the
projected data.
• For each method we repeat K-Means for 100
times with the same initialization and keep a
record of the best clustering among the 100
runs.
68. Experiments on Projection
• For our method, DUDR, we set the parameter
to be self-tuned as this: in each iteration we
compute the number of zero eigenvalues, if
it’s larger than k, then divide by 2;
• if smaller than k then multiply by 2;
otherwise stop the iteration.
74. Experiments on Projection
• Apparently DUDR outperforms PCA and LPP
under different circumstances, and the
superiority is especially evident when the
number of projected dimension is small.
75. Experiments on Projection
• The DUDR method is able to project the
original data to a subspace with quite small
dimensions k-1, where k is the number of
clusters in the data set. Such low-dimensional
subspace projected by our method even gain
an advantage over that obtained by PCA and
LPP with higher dimensions.
76. Experiments on Projection
• So with DUDR, we can project the data to a
much lower dimensional space with the
clustering ability not weakened, which makes
the dimensionality reduction process more
efficient and effective.
77. Experiments on Clustering
• We evaluated the clustering ability of DUDR
on all the 15 benchmark data sets and
compared with K-means, Ratio Cut,
Normalized Cut and NMF methods.
78. Experiments on Clustering
• In the clustering experiment, we set the number of clusters
to be the ground truth k in each data set and we set the
projected dimension in DUDR to be k-1. Similar to that of
the previous subsection, for all the methods in need of an
affinity matrix as an input, like Ratio Cut, Normalized Cut
and NMF, the graph is constructed with the self-tune
Gaussian method.
• For all the methods involving K-means, including K-means,
Ratio Cut and Normalized Cut, we run K-means for 100
times with the same initialization and write down their
average performance, standard deviation and the
performance corresponding to the best K-Means objection
function value. As for NMF and DUDR, we run only once
and record the results.
79. Experiments on Clustering
• The evaluation is based on two widely used
clustering metrics: accuracy, NMI (normalized
mutual information).
82. Experiments on Clustering
• The results summarized in Table 2 and Table 3 prove
that DUDR outperforms other related methods on
most of the benchmark data sets. In most cases DUDR
makes a equivalent or even better accuracy and NMI
with less time consumed, since K-means, Ratio Cut and
Normalized Cut need 100 times run but DUDR only
needs a few numbers of iteration;
• NMF requires a graph constructed before hand but
DUDR doesn’t. Besides, the clustering results of DUDR is
steady in a certain setting while other methods are not
stable and heavily dependent on the initialization.