SlideShare una empresa de Scribd logo
1 de 83
Descargar para leer sin conexión
Graph Embedding Discriminative
Unsupervised Dimensionality
Reduction
YUN LIU
What do we want computers to do
with our data?
Find human face
How many people
What’s the action
Action similarity
…..
Machine learning for image
classification
Learning algorithm
car
rabbit
Pixel1
Pixel2
Machine learning for image
classification
Feature
Representation
car
rabbit
Pixel1
Pixel2
Eyes
wheels
Learning
Algorithm
Outline
• Techniques for Dimension Reduction
-- Feature Extraction
-- Feature Selection
• Discriminative Unsupervised Dimensionality
Reduction
-- Algorithm
-- Experiment result
Why Dimension Reduction?
• Mankind generates over 1000 exabytes of
data each year we are facing a data deluge
• Rapid development of data collection
technology
http://www.ivizsecurity.com
Why Dimension Reduction?
• Classical data mining methods were designed
for use in application domains where the data
has low dimension
• Machine Learning/Data mining methods may
not be effective for high-dimensional data…
--Curse of dimensionality: the number of observations needed
to estimate an arbitrary function with a given level of accuracy
grows exponentially with the number of dimensions
Solution:
• Dimension reduction aims at mapping the
data to lower dimensional space so that
• Dimensionality reduction facilitates
Noise removal
Classification, prediction
Visualization e.g. projection of high-dimensional data onto 2D
Communication and storage of high-dimensional data
Uninformative variance in the data is removed
A subspace in which the data resides is detected
Application
Customer relationship
management
face recognition Image retrieval
text mining
biological data analysis
(e.g. microarray data
analysis, protein
classification),
handwritten digit
recognition
intrusion detection
Methods for Dimension Reduction
Dimension
Reduction
Feature
Extraction
Feature
Selection
Outline
• Techniques for Dimension Reduction
-- Feature Extraction
-- Feature Selection
• Discriminative Unsupervised Dimensionality
Reduction
-- Algorithm
-- Experiment result
Techniques for Dimension Reduction—
feature extraction
• Feature extraction
--Assumption: The data resides in a low-
dimensional space
--Aim: find the low-dimensional
representation of the data points:
--The new feature are linear combinations of
the original features
Techniques for Dimension Reduction—
feature extraction
• Benefits:
• Improve subsequent mining performance
(e.g. classification/prediction accuracy)
• Reduces speed of subsequent computations
Techniques for Dimension Reduction-
Feature selection
• Feature selection:
--Assumption: Among the available features,
only a certain subset is truly relevant (e.g. for
subsequent classification etc.)
--Aim: Selecting a subset of features that are
most informative.
Techniques for Dimension Reduction-
Feature selection
• Benefits:
--Improve subsequent mining performance(e.g.
classification accuracy)
--Reduces speed of subsequent computations
--Improves interpretability of the results.
Feature extraction/reduction
• Unsupervised Feature reduction: does not use
labels.
Principal Component Analysis (PCA)
Non-linear PCA
Multidimensional scaling
Manifold learning algorithms
Independent Component Analysis (ICA)
Feature extraction/reduction
• Supervised feature reduction: makes use of
labels
• Semi-supervised: used labeled and unlabeled
data
Linear Discriminant Analysis (LDA)
Supervised PCA
Canonical Correlation
Partial Least Squares (PLS)
Independent Component Analysis (ICA)
PCA
• Converts a set of possibly correlated
variables into a set of linearly uncorrelated
ones: the principal components,
where
The 1st PC accounts for the as much variability in
the data as possible
The 2nd PC accounts for the as much variability
in the data as possible with the constraint that
it should be orthogonal to the 1st PC.
LDA
• Task: Classification
• Goal: Finds a linear combination of features
that best separate the data classes.
• Assumption: Assume the data points for each
class comes from a Gaussian distribution with
same variance, but different means.
• Estimate means and variances and apply
maximum likelihood principle.
LDA
• Linear Discriminant functions for class k
o Within-class scatter matrix is
o Between-class scatter matrix is
o The projection matrix :
LDA: Sample applications
• Classification of patients based on microarray
data
- Regularized linear discriminant analysis and its application in microarrays
Guo et al, 2007
• Marketing:
- Determine the factors which distinguish different types of
customers and products on the basic of surveys or other
forms of collected data
• Character recognition:
- Improve Handwritten Character Recognition Performance.
Ueki et al, 2008
Outline
• Techniques for Dimension Reduction
-- Feature Extraction
-- Feature Selection
• Discriminative Unsupervised Dimensionality
Reduction
-- Algorithm
-- Experiment result
Feature Selection
• Feature Selection can be viewed a “discrete
version” of Feature Reduction:
• Only a subset of the features are kept
• No transformation is applied
• Main advantage compared to feature
reduction:
Ease of Interpretation!
Irrelevant yet possibly costly features need not be collected anymore.
Feature Selection: Filter Methods
• All features
• The features are selected regardless of the subsequent task as
hand
• Relies on general characteristics of the data
• Not tailored to any learning algorithm
• Fast and easy
Predictive ModelAll Features Filter
Subset of Features
Feature Selection Algorithm
Feature
Selection
Algorithm
Filter algorithms
Feature ranking
algorithms: Relief
Subset search
algorithms: Focus
Wrapper
algorithms
Feature ranking
algorithm: SVM
Subset search
algorithms
Applications of Feature Selection
Applications
Customer
relationship
managemen
t
Text
categorization
Image
retrieval
Gene
expression
microarray
data
analysis
Intrusion
Detection
Outline
• Techniques for Dimension Reduction
-- Feature Extraction
-- Feature Selection
• Discriminative Unsupervised Dimensionality
Reduction
-- Algorithm
-- Experiment result
Limitation of traditional graph
embedding dimensionality reduction
most of the state-of-the-art graph embedding
dimensionality reduction methods require an
affinity graph constructed before hand, which
makes their projection ability dependent on
the input of the graph to a large extent.
Proposed method
• A novel graph embedding method for
unsupervised dimensionality reduction.
Graph Embedding Discriminative
Unsupervised Dimension Reduction
Proposed dimensionality
reduction method
Dimensionality
Reduction
Graph Construction
assigning the
adaptive and
optimal neighbors
based on the
projected local
distances
Notation
• The Frobenius norm of is defined as
• For an vector , when , -norm is
defined as
Traditional Method
• The total distance between all pairs of data
points would be
Here H is the centering matrix, defined as:
Traditional Method
• Traditionally, dimensional reduction methods
solve the following problem, which require a
graph constructed before hand:
Traditional Method
• Separate the dimensionality reduction and
graph construction
• Thus its highly dependent on the input of the
graph.
Traditional
Method
Graph
Construction
Dimension
Reduction
Proposed method
• Consider an affinity matrix , which
implies the probability of each data point in X
to connect with its neighbors.
• a larger probability should be assigned to a
pair with smaller distance
Proposed Method
• Construct the graph in the projected space by
solving:
Trivial Solution
• Above problem has a trivial solution that only
the nearest data point of is assigned a
probability as 1.
• while all others are assigned 0, that is to say,
they are not the neighbor of .
How to avoid trivial solution
• To avoid the trivial solution, we add two
constraint on the graph to make its structure
clear.
1. The probability within cluster should be nonzero
while the probability between cluster should be zero.
2. The probability within a certain cluster should be
equally distributed.
How to accomplish it?
• Given , suppose each node is assigned
a function value as , then it can be
verified that
How to accomplish it?
• Where is the Laplacian matrix in
graph theory.
• Here we define a diagonal matrix as the
degree matrix . For the diagonal matrix,
the i-th diagonal element is
Theorem 1
• If the Laplacian matrix has an important
property when the probability matrix is
nonnegative. The property is described as
follows:
Theorem 1 The multiplicity k of the
eigenvalue 0 of the Laplacian matrix is
equal to the number of connected
components in the graph associated with S.
Theorem 1
• If , the graph possesses an ideal structure
which could explicity partition the data points into
exactly k clusters according to the block diagonal
structure.
• Thus the problem becomes:
Optimization Algorithm Solving
Problem
• The above Problem seems very difficult to
solve. So next we proposed optimization
algorithm Solving Problem
• Suppose is the i-th smallest eigenvalue
of . It is easy to see that since is
positive semi-definite. So for a large enough
Optimization Algorithm Solving
Problem
• The above problem would be equivalent to:
•
Optimization Algorithm Solving
Problem
• From the above equation, a large enough
would guarantee that the k smallest
eigenvalues of are zero and thus the rank of
is .
• According to the Ky Fan’s Theorem (Fan,
1949), we have
Optimization Algorithm Solving
Problem
• So we turn to solve:
Alternative optimization Method
• The first step is fixing and solving . Then
problem becomes:
• The optimal solution of F is formed by the k
eigenvectors corresponding to the k smallest
eigenvalues of .
Alternative optimization Method
• The second step is fixing and solving .
Then problem becomes:
Alternative optimization Method
• We can use a iterative re-weighted method to
solve . The Lagrangian function of above
problem is
• Taking derivative w.r.t. W and set it to zero, we
have:
Alternative optimization Method
• The solution of W in above Problem is formed
by the m eigenvectors corresponding to the m
smallest eigenvalues of the matrix:
Alternative optimization Method
• The third step is fixing and solving S. then
the problem becomes:
Alternative optimization Method
The above problem can be solved separately for
each as follows:
Where and
Alternative optimization Method
• The above problem can be rewritten as
• Where , Then we can update
accordingly.
Summary for Algorithm
• Input: Date matrix , number of clusters
k, reduced dimension m, parameter , a large
enough .
• Output: Projection and probability
matrix with exactly k connected
components.
Summary for Algorithm
• Initialize S by the optimal solution to the
problem without the constraint on
• while not converge do
• 1. Update , where is a
diagonal matrix with the i-th diagonal element
as
Summary for Algorithm
• 2. Update F, whose columns are the k
eigenvectors of corresponding to the k
smallest eigenvalues.
• Update W, whose columns are the m
eigenvectors of matrix in Eq. (15)
corresponding to the m smallest eigenvalues.
Update W iteratively until converges.
• 3. For each I, update the i-th row of S by
solving the problem (18). End while
Experimental Results
For simplicity, we denote our clustering method
as DUDR (Discriminative Unsupervised
Dimensionality Reduction).
Experiments on Synthetic Data
• The synthetic data in this experiment is a
randomly generated two-Gaussian matrix.
• We generate two clusters of data which obeys
the Gaussian distribution.
Experiments on Synthetic Data
• Our goal here is to find an effective projection
direction in which the two clusters could be
explicitly set apart.
• We compare our dimensionality reduction
method DUDR with two related methods. PCA
and LPP.
(a) Cluster far away
• When these two clusters are far from each other, all
these three methods could easily find a good
projection direction.
(b) Clusters relatively close
• However, as the distance between these two
clusters lower down, PCA becomes
incompetent.
(c) Clusters fairly close
• As the two clusters draw closer, LPP also lose its way
to achieve the projection goal. Whereas the DUDR
method consistently works well under all occasions.
Reason for this phenomenon
1. PCA is a method focused on the global structure, so when
the two clusters approach to a certain extent, PCA is unable to
distinguish one cluster from the other thus fails immediately.
2. As for LPP, it pays more attention to the local structure, thus
works well when two cluster is relatively close. But when the
distance becomes even smaller, LPP is not capable anymore.
3. But our method, DUDR, lays more emphasis on the
discriminative structure and thus is able to keep its projection
ability all the time.
Experiments on Real Benchmark
Datasets
• 15 benchmark datasets:
• Five are shape set data, six are data sets from
UCI Machine Learning Repository and the other
four are image data sets.
Ecoli Pathbased Aggregation Compound
Breast
Cancer
Yeast R15 Glass Spiral Abalone
Movements Jaffe AR ImData XM2VTS Coil20
Descriptions of these 15 datasets
Experiments on Projection
• We evaluated our dimensionality reduction
method on the 5 benchmark data sets with
high dimensions: AR ImData, Movements,
Coil20, Jaffe, XM2VT.
Experiments on Projection
• The comparison is based on the clustering
experiments, where we first learn the
projection matrix separately with these three
methods and then run K-means on the
projected data.
• For each method we repeat K-Means for 100
times with the same initialization and keep a
record of the best clustering among the 100
runs.
Experiments on Projection
• For our method, DUDR, we set the parameter
to be self-tuned as this: in each iteration we
compute the number of zero eigenvalues, if
it’s larger than k, then divide by 2;
• if smaller than k then multiply by 2;
otherwise stop the iteration.
Experiments on Projection
Experiments on Projection
Experiments on Projection
Experiments on Projection
Experiments on Projection
Experiments on Projection
• Apparently DUDR outperforms PCA and LPP
under different circumstances, and the
superiority is especially evident when the
number of projected dimension is small.
Experiments on Projection
• The DUDR method is able to project the
original data to a subspace with quite small
dimensions k-1, where k is the number of
clusters in the data set. Such low-dimensional
subspace projected by our method even gain
an advantage over that obtained by PCA and
LPP with higher dimensions.
Experiments on Projection
• So with DUDR, we can project the data to a
much lower dimensional space with the
clustering ability not weakened, which makes
the dimensionality reduction process more
efficient and effective.
Experiments on Clustering
• We evaluated the clustering ability of DUDR
on all the 15 benchmark data sets and
compared with K-means, Ratio Cut,
Normalized Cut and NMF methods.
Experiments on Clustering
• In the clustering experiment, we set the number of clusters
to be the ground truth k in each data set and we set the
projected dimension in DUDR to be k-1. Similar to that of
the previous subsection, for all the methods in need of an
affinity matrix as an input, like Ratio Cut, Normalized Cut
and NMF, the graph is constructed with the self-tune
Gaussian method.
• For all the methods involving K-means, including K-means,
Ratio Cut and Normalized Cut, we run K-means for 100
times with the same initialization and write down their
average performance, standard deviation and the
performance corresponding to the best K-Means objection
function value. As for NMF and DUDR, we run only once
and record the results.
Experiments on Clustering
• The evaluation is based on two widely used
clustering metrics: accuracy, NMI (normalized
mutual information).
Experiments on Clustering
Experiments on Clustering
Experiments on Clustering
• The results summarized in Table 2 and Table 3 prove
that DUDR outperforms other related methods on
most of the benchmark data sets. In most cases DUDR
makes a equivalent or even better accuracy and NMI
with less time consumed, since K-means, Ratio Cut and
Normalized Cut need 100 times run but DUDR only
needs a few numbers of iteration;
• NMF requires a graph constructed before hand but
DUDR doesn’t. Besides, the clustering results of DUDR is
steady in a certain setting while other methods are not
stable and heavily dependent on the initialization.
Thanks

Más contenido relacionado

La actualidad más candente

Curse of dimensionality
Curse of dimensionalityCurse of dimensionality
Curse of dimensionality
Nikhil Sharma
 
Designed by Identity MLP
Designed by Identity MLP Designed by Identity MLP
Designed by Identity MLP
butest
 

La actualidad más candente (19)

Understandig PCA and LDA
Understandig PCA and LDAUnderstandig PCA and LDA
Understandig PCA and LDA
 
Branch And Bound and Beam Search Feature Selection Algorithms
Branch And Bound and Beam Search Feature Selection AlgorithmsBranch And Bound and Beam Search Feature Selection Algorithms
Branch And Bound and Beam Search Feature Selection Algorithms
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality Reduction
 
Classification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsClassification Based Machine Learning Algorithms
Classification Based Machine Learning Algorithms
 
08 distributed optimization
08 distributed optimization08 distributed optimization
08 distributed optimization
 
Curse of dimensionality
Curse of dimensionalityCurse of dimensionality
Curse of dimensionality
 
Data analytics concepts
Data analytics conceptsData analytics concepts
Data analytics concepts
 
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8
 
Ml10 dimensionality reduction-and_advanced_topics
Ml10 dimensionality reduction-and_advanced_topicsMl10 dimensionality reduction-and_advanced_topics
Ml10 dimensionality reduction-and_advanced_topics
 
Designed by Identity MLP
Designed by Identity MLP Designed by Identity MLP
Designed by Identity MLP
 
IRJET- Performance Evaluation of Various Classification Algorithms
IRJET- Performance Evaluation of Various Classification AlgorithmsIRJET- Performance Evaluation of Various Classification Algorithms
IRJET- Performance Evaluation of Various Classification Algorithms
 
Foundations: Artificial Neural Networks
Foundations: Artificial Neural NetworksFoundations: Artificial Neural Networks
Foundations: Artificial Neural Networks
 
Basics of Machine Learning
Basics of Machine LearningBasics of Machine Learning
Basics of Machine Learning
 
WAVELET BASED AUTHENTICATION/SECRET TRANSMISSION THROUGH IMAGE RESIZING (WA...
WAVELET BASED AUTHENTICATION/SECRET  TRANSMISSION THROUGH IMAGE RESIZING  (WA...WAVELET BASED AUTHENTICATION/SECRET  TRANSMISSION THROUGH IMAGE RESIZING  (WA...
WAVELET BASED AUTHENTICATION/SECRET TRANSMISSION THROUGH IMAGE RESIZING (WA...
 
Pca ankita dubey
Pca ankita dubeyPca ankita dubey
Pca ankita dubey
 
Machine_Learning_Trushita
Machine_Learning_TrushitaMachine_Learning_Trushita
Machine_Learning_Trushita
 
Segmentation by Fusion of Self-Adaptive SFCM Cluster in Multi-Color Space Com...
Segmentation by Fusion of Self-Adaptive SFCM Cluster in Multi-Color Space Com...Segmentation by Fusion of Self-Adaptive SFCM Cluster in Multi-Color Space Com...
Segmentation by Fusion of Self-Adaptive SFCM Cluster in Multi-Color Space Com...
 
Transformer based approaches for visual representation learning
Transformer based approaches for visual representation learningTransformer based approaches for visual representation learning
Transformer based approaches for visual representation learning
 
Beginners Guide to Non-Negative Matrix Factorization
Beginners Guide to Non-Negative Matrix FactorizationBeginners Guide to Non-Negative Matrix Factorization
Beginners Guide to Non-Negative Matrix Factorization
 

Similar a 30thSep2014

Cahall Final Intern Presentation
Cahall Final Intern PresentationCahall Final Intern Presentation
Cahall Final Intern Presentation
Daniel Cahall
 
Week 12 Dimensionality Reduction Bagian 1
Week 12 Dimensionality Reduction Bagian 1Week 12 Dimensionality Reduction Bagian 1
Week 12 Dimensionality Reduction Bagian 1
khairulhuda242
 
Final Presentation - Edan&Itzik
Final Presentation - Edan&ItzikFinal Presentation - Edan&Itzik
Final Presentation - Edan&Itzik
itzik cohen
 

Similar a 30thSep2014 (20)

Module-4_Part-II.pptx
Module-4_Part-II.pptxModule-4_Part-II.pptx
Module-4_Part-II.pptx
 
introduction to Statistical Theory.pptx
 introduction to Statistical Theory.pptx introduction to Statistical Theory.pptx
introduction to Statistical Theory.pptx
 
Dimensionality Reduction in Machine Learning
Dimensionality Reduction in Machine LearningDimensionality Reduction in Machine Learning
Dimensionality Reduction in Machine Learning
 
Cahall Final Intern Presentation
Cahall Final Intern PresentationCahall Final Intern Presentation
Cahall Final Intern Presentation
 
Dimensionality Reduction.pptx
Dimensionality Reduction.pptxDimensionality Reduction.pptx
Dimensionality Reduction.pptx
 
Data mining with weka
Data mining with wekaData mining with weka
Data mining with weka
 
random forest.pptx
random forest.pptxrandom forest.pptx
random forest.pptx
 
An Introduction to Deep Learning
An Introduction to Deep LearningAn Introduction to Deep Learning
An Introduction to Deep Learning
 
Introduction to Deep Learning
Introduction to Deep LearningIntroduction to Deep Learning
Introduction to Deep Learning
 
background.pptx
background.pptxbackground.pptx
background.pptx
 
Week 12 Dimensionality Reduction Bagian 1
Week 12 Dimensionality Reduction Bagian 1Week 12 Dimensionality Reduction Bagian 1
Week 12 Dimensionality Reduction Bagian 1
 
General Tips for participating Kaggle Competitions
General Tips for participating Kaggle CompetitionsGeneral Tips for participating Kaggle Competitions
General Tips for participating Kaggle Competitions
 
PCA.pptx
PCA.pptxPCA.pptx
PCA.pptx
 
Lecture 8 - Feature Engineering and Optimization, a lecture in subject module...
Lecture 8 - Feature Engineering and Optimization, a lecture in subject module...Lecture 8 - Feature Engineering and Optimization, a lecture in subject module...
Lecture 8 - Feature Engineering and Optimization, a lecture in subject module...
 
Final Presentation - Edan&Itzik
Final Presentation - Edan&ItzikFinal Presentation - Edan&Itzik
Final Presentation - Edan&Itzik
 
Nbvtalkonfeatureselection
NbvtalkonfeatureselectionNbvtalkonfeatureselection
Nbvtalkonfeatureselection
 
Object detection at night
Object detection at nightObject detection at night
Object detection at night
 
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
 
Working with the data for Machine Learning
Working with the data for Machine LearningWorking with the data for Machine Learning
Working with the data for Machine Learning
 
Machine learning and linear regression programming
Machine learning and linear regression programmingMachine learning and linear regression programming
Machine learning and linear regression programming
 

30thSep2014

  • 1. Graph Embedding Discriminative Unsupervised Dimensionality Reduction YUN LIU
  • 2. What do we want computers to do with our data? Find human face How many people What’s the action Action similarity …..
  • 3. Machine learning for image classification Learning algorithm car rabbit Pixel1 Pixel2
  • 4. Machine learning for image classification Feature Representation car rabbit Pixel1 Pixel2 Eyes wheels Learning Algorithm
  • 5. Outline • Techniques for Dimension Reduction -- Feature Extraction -- Feature Selection • Discriminative Unsupervised Dimensionality Reduction -- Algorithm -- Experiment result
  • 6. Why Dimension Reduction? • Mankind generates over 1000 exabytes of data each year we are facing a data deluge • Rapid development of data collection technology http://www.ivizsecurity.com
  • 7. Why Dimension Reduction? • Classical data mining methods were designed for use in application domains where the data has low dimension • Machine Learning/Data mining methods may not be effective for high-dimensional data… --Curse of dimensionality: the number of observations needed to estimate an arbitrary function with a given level of accuracy grows exponentially with the number of dimensions
  • 8. Solution: • Dimension reduction aims at mapping the data to lower dimensional space so that • Dimensionality reduction facilitates Noise removal Classification, prediction Visualization e.g. projection of high-dimensional data onto 2D Communication and storage of high-dimensional data Uninformative variance in the data is removed A subspace in which the data resides is detected
  • 9. Application Customer relationship management face recognition Image retrieval text mining biological data analysis (e.g. microarray data analysis, protein classification), handwritten digit recognition intrusion detection
  • 10. Methods for Dimension Reduction Dimension Reduction Feature Extraction Feature Selection
  • 11. Outline • Techniques for Dimension Reduction -- Feature Extraction -- Feature Selection • Discriminative Unsupervised Dimensionality Reduction -- Algorithm -- Experiment result
  • 12. Techniques for Dimension Reduction— feature extraction • Feature extraction --Assumption: The data resides in a low- dimensional space --Aim: find the low-dimensional representation of the data points: --The new feature are linear combinations of the original features
  • 13. Techniques for Dimension Reduction— feature extraction • Benefits: • Improve subsequent mining performance (e.g. classification/prediction accuracy) • Reduces speed of subsequent computations
  • 14. Techniques for Dimension Reduction- Feature selection • Feature selection: --Assumption: Among the available features, only a certain subset is truly relevant (e.g. for subsequent classification etc.) --Aim: Selecting a subset of features that are most informative.
  • 15. Techniques for Dimension Reduction- Feature selection • Benefits: --Improve subsequent mining performance(e.g. classification accuracy) --Reduces speed of subsequent computations --Improves interpretability of the results.
  • 16. Feature extraction/reduction • Unsupervised Feature reduction: does not use labels. Principal Component Analysis (PCA) Non-linear PCA Multidimensional scaling Manifold learning algorithms Independent Component Analysis (ICA)
  • 17. Feature extraction/reduction • Supervised feature reduction: makes use of labels • Semi-supervised: used labeled and unlabeled data Linear Discriminant Analysis (LDA) Supervised PCA Canonical Correlation Partial Least Squares (PLS) Independent Component Analysis (ICA)
  • 18. PCA • Converts a set of possibly correlated variables into a set of linearly uncorrelated ones: the principal components, where The 1st PC accounts for the as much variability in the data as possible The 2nd PC accounts for the as much variability in the data as possible with the constraint that it should be orthogonal to the 1st PC.
  • 19. LDA • Task: Classification • Goal: Finds a linear combination of features that best separate the data classes. • Assumption: Assume the data points for each class comes from a Gaussian distribution with same variance, but different means. • Estimate means and variances and apply maximum likelihood principle.
  • 20. LDA • Linear Discriminant functions for class k o Within-class scatter matrix is o Between-class scatter matrix is o The projection matrix :
  • 21. LDA: Sample applications • Classification of patients based on microarray data - Regularized linear discriminant analysis and its application in microarrays Guo et al, 2007 • Marketing: - Determine the factors which distinguish different types of customers and products on the basic of surveys or other forms of collected data • Character recognition: - Improve Handwritten Character Recognition Performance. Ueki et al, 2008
  • 22. Outline • Techniques for Dimension Reduction -- Feature Extraction -- Feature Selection • Discriminative Unsupervised Dimensionality Reduction -- Algorithm -- Experiment result
  • 23. Feature Selection • Feature Selection can be viewed a “discrete version” of Feature Reduction: • Only a subset of the features are kept • No transformation is applied • Main advantage compared to feature reduction: Ease of Interpretation! Irrelevant yet possibly costly features need not be collected anymore.
  • 24. Feature Selection: Filter Methods • All features • The features are selected regardless of the subsequent task as hand • Relies on general characteristics of the data • Not tailored to any learning algorithm • Fast and easy Predictive ModelAll Features Filter Subset of Features
  • 25. Feature Selection Algorithm Feature Selection Algorithm Filter algorithms Feature ranking algorithms: Relief Subset search algorithms: Focus Wrapper algorithms Feature ranking algorithm: SVM Subset search algorithms
  • 26. Applications of Feature Selection Applications Customer relationship managemen t Text categorization Image retrieval Gene expression microarray data analysis Intrusion Detection
  • 27. Outline • Techniques for Dimension Reduction -- Feature Extraction -- Feature Selection • Discriminative Unsupervised Dimensionality Reduction -- Algorithm -- Experiment result
  • 28. Limitation of traditional graph embedding dimensionality reduction most of the state-of-the-art graph embedding dimensionality reduction methods require an affinity graph constructed before hand, which makes their projection ability dependent on the input of the graph to a large extent.
  • 29. Proposed method • A novel graph embedding method for unsupervised dimensionality reduction.
  • 30. Graph Embedding Discriminative Unsupervised Dimension Reduction Proposed dimensionality reduction method Dimensionality Reduction Graph Construction assigning the adaptive and optimal neighbors based on the projected local distances
  • 31. Notation • The Frobenius norm of is defined as • For an vector , when , -norm is defined as
  • 32. Traditional Method • The total distance between all pairs of data points would be Here H is the centering matrix, defined as:
  • 33. Traditional Method • Traditionally, dimensional reduction methods solve the following problem, which require a graph constructed before hand:
  • 34. Traditional Method • Separate the dimensionality reduction and graph construction • Thus its highly dependent on the input of the graph. Traditional Method Graph Construction Dimension Reduction
  • 35. Proposed method • Consider an affinity matrix , which implies the probability of each data point in X to connect with its neighbors. • a larger probability should be assigned to a pair with smaller distance
  • 36. Proposed Method • Construct the graph in the projected space by solving:
  • 37. Trivial Solution • Above problem has a trivial solution that only the nearest data point of is assigned a probability as 1. • while all others are assigned 0, that is to say, they are not the neighbor of .
  • 38. How to avoid trivial solution • To avoid the trivial solution, we add two constraint on the graph to make its structure clear. 1. The probability within cluster should be nonzero while the probability between cluster should be zero. 2. The probability within a certain cluster should be equally distributed.
  • 39. How to accomplish it? • Given , suppose each node is assigned a function value as , then it can be verified that
  • 40. How to accomplish it? • Where is the Laplacian matrix in graph theory. • Here we define a diagonal matrix as the degree matrix . For the diagonal matrix, the i-th diagonal element is
  • 41. Theorem 1 • If the Laplacian matrix has an important property when the probability matrix is nonnegative. The property is described as follows: Theorem 1 The multiplicity k of the eigenvalue 0 of the Laplacian matrix is equal to the number of connected components in the graph associated with S.
  • 42. Theorem 1 • If , the graph possesses an ideal structure which could explicity partition the data points into exactly k clusters according to the block diagonal structure. • Thus the problem becomes:
  • 43. Optimization Algorithm Solving Problem • The above Problem seems very difficult to solve. So next we proposed optimization algorithm Solving Problem • Suppose is the i-th smallest eigenvalue of . It is easy to see that since is positive semi-definite. So for a large enough
  • 44. Optimization Algorithm Solving Problem • The above problem would be equivalent to: •
  • 45. Optimization Algorithm Solving Problem • From the above equation, a large enough would guarantee that the k smallest eigenvalues of are zero and thus the rank of is . • According to the Ky Fan’s Theorem (Fan, 1949), we have
  • 47. Alternative optimization Method • The first step is fixing and solving . Then problem becomes: • The optimal solution of F is formed by the k eigenvectors corresponding to the k smallest eigenvalues of .
  • 48. Alternative optimization Method • The second step is fixing and solving . Then problem becomes:
  • 49. Alternative optimization Method • We can use a iterative re-weighted method to solve . The Lagrangian function of above problem is • Taking derivative w.r.t. W and set it to zero, we have:
  • 50. Alternative optimization Method • The solution of W in above Problem is formed by the m eigenvectors corresponding to the m smallest eigenvalues of the matrix:
  • 51. Alternative optimization Method • The third step is fixing and solving S. then the problem becomes:
  • 52. Alternative optimization Method The above problem can be solved separately for each as follows: Where and
  • 53. Alternative optimization Method • The above problem can be rewritten as • Where , Then we can update accordingly.
  • 54. Summary for Algorithm • Input: Date matrix , number of clusters k, reduced dimension m, parameter , a large enough . • Output: Projection and probability matrix with exactly k connected components.
  • 55. Summary for Algorithm • Initialize S by the optimal solution to the problem without the constraint on • while not converge do • 1. Update , where is a diagonal matrix with the i-th diagonal element as
  • 56. Summary for Algorithm • 2. Update F, whose columns are the k eigenvectors of corresponding to the k smallest eigenvalues. • Update W, whose columns are the m eigenvectors of matrix in Eq. (15) corresponding to the m smallest eigenvalues. Update W iteratively until converges. • 3. For each I, update the i-th row of S by solving the problem (18). End while
  • 57. Experimental Results For simplicity, we denote our clustering method as DUDR (Discriminative Unsupervised Dimensionality Reduction).
  • 58. Experiments on Synthetic Data • The synthetic data in this experiment is a randomly generated two-Gaussian matrix. • We generate two clusters of data which obeys the Gaussian distribution.
  • 59. Experiments on Synthetic Data • Our goal here is to find an effective projection direction in which the two clusters could be explicitly set apart. • We compare our dimensionality reduction method DUDR with two related methods. PCA and LPP.
  • 60. (a) Cluster far away • When these two clusters are far from each other, all these three methods could easily find a good projection direction.
  • 61. (b) Clusters relatively close • However, as the distance between these two clusters lower down, PCA becomes incompetent.
  • 62. (c) Clusters fairly close • As the two clusters draw closer, LPP also lose its way to achieve the projection goal. Whereas the DUDR method consistently works well under all occasions.
  • 63. Reason for this phenomenon 1. PCA is a method focused on the global structure, so when the two clusters approach to a certain extent, PCA is unable to distinguish one cluster from the other thus fails immediately. 2. As for LPP, it pays more attention to the local structure, thus works well when two cluster is relatively close. But when the distance becomes even smaller, LPP is not capable anymore. 3. But our method, DUDR, lays more emphasis on the discriminative structure and thus is able to keep its projection ability all the time.
  • 64. Experiments on Real Benchmark Datasets • 15 benchmark datasets: • Five are shape set data, six are data sets from UCI Machine Learning Repository and the other four are image data sets. Ecoli Pathbased Aggregation Compound Breast Cancer Yeast R15 Glass Spiral Abalone Movements Jaffe AR ImData XM2VTS Coil20
  • 65. Descriptions of these 15 datasets
  • 66. Experiments on Projection • We evaluated our dimensionality reduction method on the 5 benchmark data sets with high dimensions: AR ImData, Movements, Coil20, Jaffe, XM2VT.
  • 67. Experiments on Projection • The comparison is based on the clustering experiments, where we first learn the projection matrix separately with these three methods and then run K-means on the projected data. • For each method we repeat K-Means for 100 times with the same initialization and keep a record of the best clustering among the 100 runs.
  • 68. Experiments on Projection • For our method, DUDR, we set the parameter to be self-tuned as this: in each iteration we compute the number of zero eigenvalues, if it’s larger than k, then divide by 2; • if smaller than k then multiply by 2; otherwise stop the iteration.
  • 74. Experiments on Projection • Apparently DUDR outperforms PCA and LPP under different circumstances, and the superiority is especially evident when the number of projected dimension is small.
  • 75. Experiments on Projection • The DUDR method is able to project the original data to a subspace with quite small dimensions k-1, where k is the number of clusters in the data set. Such low-dimensional subspace projected by our method even gain an advantage over that obtained by PCA and LPP with higher dimensions.
  • 76. Experiments on Projection • So with DUDR, we can project the data to a much lower dimensional space with the clustering ability not weakened, which makes the dimensionality reduction process more efficient and effective.
  • 77. Experiments on Clustering • We evaluated the clustering ability of DUDR on all the 15 benchmark data sets and compared with K-means, Ratio Cut, Normalized Cut and NMF methods.
  • 78. Experiments on Clustering • In the clustering experiment, we set the number of clusters to be the ground truth k in each data set and we set the projected dimension in DUDR to be k-1. Similar to that of the previous subsection, for all the methods in need of an affinity matrix as an input, like Ratio Cut, Normalized Cut and NMF, the graph is constructed with the self-tune Gaussian method. • For all the methods involving K-means, including K-means, Ratio Cut and Normalized Cut, we run K-means for 100 times with the same initialization and write down their average performance, standard deviation and the performance corresponding to the best K-Means objection function value. As for NMF and DUDR, we run only once and record the results.
  • 79. Experiments on Clustering • The evaluation is based on two widely used clustering metrics: accuracy, NMI (normalized mutual information).
  • 82. Experiments on Clustering • The results summarized in Table 2 and Table 3 prove that DUDR outperforms other related methods on most of the benchmark data sets. In most cases DUDR makes a equivalent or even better accuracy and NMI with less time consumed, since K-means, Ratio Cut and Normalized Cut need 100 times run but DUDR only needs a few numbers of iteration; • NMF requires a graph constructed before hand but DUDR doesn’t. Besides, the clustering results of DUDR is steady in a certain setting while other methods are not stable and heavily dependent on the initialization.