Dimensionality Reduction Techniques

Dimensionality Reduction
Group 26
Akash Baguant, Johnathan Mei, Rowan Pritchett, Moshe Steinberg
Supervised by: Dr Badr Missaoui
Abstract
High dimensional data is becoming increasingly prevalent in modern society. In our
project, we will discuss the reasons for reducing the dimensionality of data, as well as
outline and compare some methods used in this domain. Additionally, we explore
clustering, and apply dimensionality reduction to image processing and ﬁnancial data
analysis.
M2R Group Project
Department of Mathematics
Imperial College London
15 June 2016

Dimensionality Reduction M2R
Contents
1 Introduction 3
2 Linear Methods 5
2.1 Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.2 PCA for Image Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Multi-Dimensional Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.1 Motivating Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.2 Classical Multi-Dimensional Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.3 Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.4 Application to Dimensionality Reduction . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.5 Extensions to the Classical MDS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Limitations of Linear Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3 Non-Linear Methods 12
3.1 Locally Linear Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.1.1 Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2 Diffusion Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2.1 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2.2 Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4 Comparison of PCA and Diffusion Maps 22
4.1 Linear Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.2 Non-Linear Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5 Clustering and dimensionality reduction applications 26
5.1 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.1.1 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.1.2 Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.2 Use of Diffusion Maps in Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
1

5.3 Application to Image Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
6 Application to ﬁnancial data 34
7 Conclusion 40
2

1 Introduction
An ever-increasing amount of data is being collected in fields as diverse as finance, consumer trend
analysis, election analysis and medicine. Moreover, this data often has a very large number of variables.
Dimensionality reduction is used to make this data more tractable. The goal of dimensionality reduction
is to take data described by a large number of variables and obtain a description of the data in terms of
a much smaller number of variables. [1]
There are two approaches to dimensionality reduction:
• Feature Selection
Data is described in terms of a subset of the original variables. Redundant and irrelevant variables
are discarded.
• Feature Extraction
Data is transformed from a high dimension space to a lower dimensional space. For example,
Principal Component Analysis and Multidimensional Scaling are linear transformations on the
data. Diffusion Maps and Locally Linear Embedding are non-linear transformations.
Some algorithms such as Independent Component Analysis aim to find unobserved variables that
give rise to the data. [2]
The data set expressed in the lower dimension should preserve the properties and structure of the data
set in the original dimension. For example, the dimensionality reduction method might be designed to
preserve the pairwise distances between data points or to preserve the variance of the data set.
In this project, we will only consider Feature Extraction.
There are a few reasons to consider dimensionality reduction:
• Computational Motivation
Lower dimensional data takes up less storage. Processing of the lower dimensional data set
demands less computational power and thus takes less time.
• Data-Visualization and Interpretation
High dimensional data is difficult to visualize and understand geometrically. Reduction to 2-D or
3-D allows the data to be analyzed geometrically, for example one can spot clusters and patterns in
the data.
3

• Statistical Motivation
– In high dimensions, the so-called ‘curse of dimensionality’ occurs.
The number of data points is very small compared to the ‘volume’ of the space. In such a
situation, algorithms (for instance search algorithms) slow down and might even fail
completely.
– In high dimensions, many surprising and counter-intuitive geometric results are seen. This
makes data analysis in high dimensions harder.
– Dimensionality reduction prevents overfitting and hence makes the models more effective and
applicable to a wider range of problems.
– Dimensionality reduction eliminates collinearity and hence improves the performance of
models.
For the reasons above, dimensionality reduction is also very useful in machine-learning. [3]
This project will outline four dimensionality reduction techniques: Principal Component Analysis
(PCA), Multidimensional Scaling (MDS), Locally Linear Embedding (LLE) and Diffusion Maps. We will
then compare linear and non-linear methods with some examples and apply some of the methods to
practical situations.
Throughout this report, we will assume that we have n data points, each of dimension p. We aim to
express these points in m dimensions, where m < p.
4

2 Linear Methods
2.1 Principal Component Analysis
Principal Component Analysis (PCA) is a dimensionality reduction method which aims to find new
uncorrelated variables which maximize variance in the data. This involves solving the related
eigenvector/value problem of the covariance matrix of the data. PCA returns eigenvectors as the
principal components since maximizing the variance of the projections along the eigenvectors is
equivalent to minimizing its residual sum of squares [4]. The first principal component is the direction
with the greatest variability, and the second principal component has the largest variance among all
directions orthogonal to the first principal component, and so on.
PCA can be thought of as an operation that reveals the internal structure of a linear dataset, by
providing the major features/directions in the data. An orthogonal transformation is performed from a
p-dimensional space to an m-dimensional space, where m < p. The resulting set of m variables are
uncorrelated, and can be expressed as linear combinations of each of the p original, correlated variables.
Also, PCA is a good tool to remove random noise in a data set, as only the eigenvectors which offer the
greatest variability are taken into account, and the projection residuals of the data points are minimized.
(a) 2D Data Set with its principal components (b) Projection onto the first eigenvector
Figure 1: PCA on 2D data set ([5] p.3)
In Figure 1a, the data points are approximately in the shape of an ellipse. Since this is a 2-D dataset,
5

there will be 2 principal components. It is clear from the images above that the major axis of the ellipse
is the eigenvector that offers the greatest variability, and thus, it is the first principal component. The
second principal component, being orthogonal to the the first, will be the minor axis. Taking only the
first principal component and projecting the data points onto it, we can see that a straight line is
obtained (as seen in Figure 1b). Thus, the dimensionality of the dataset is now 1, and it is expressed in
terms of this eigenvector.
2.1.1 Procedure
1. Suppose we have a p dimensional set of data, and n observations. We therefore intend to find the
first m principal components, where m < p. Place the data in an n × p matrix, called X. Then, let
˜y ∈ Rp
be the vector that contains the means of the columns of X, i.e.
yj =
1
n
n
i=1
Xij
2. Subtract the mean from each column, to obtain a matrix A
A = X − h˜yT
where ˜h is a n x 1 vector of ones. This is done to center the data around the origin. The matrix A
should take the form
A =








x11 − y1 . . . x1p − yp
...
...
...
xn1 − y1 . . . xnp − yp








3. Calculate the covariance matrix of the data, Σ. This is done by taking the inner product of A with
itself. It is then divided by n − 1 to account for the bias of the variance in a random sample.
Σ =
1
n − 1
AT
A
4. Find the eigenvalues and eigenvectors of Σ. Form the matrix P of eigenvectors and its associated
diagonal matrix D such that
Σ = PDPT
Since Σ is a symmetric, positive definite p × p matrix, it has p distinct eigenvalues, and p linearly
independent eigenvectors, meaning that such matrices P and D exist.
6

Rearrange the entries of D and P so that the eigenvalues in D are in decreasing order. It is also
important the eigenvectors in P should be of unit length.
5. Pick the first m columns of P as the principal components of the data. These should be the
eigenvectors which have the largest eigenvalues of Σ. To avoid too much loss of information, m is
appropriately chosen so that
m
i=1 λi
p
j=1 λj
≥ π
where λ1, λ2, . . . ,λp are the eigenvalues of Σ in order of decreasing size. π is the minimum
proportion of total variance that we would like to account for, and is a subjective threshold to
decide on the number of principal components, m, to be retained. Normally, greater weight is
placed on the components with the greatest variability, but there are circumstances in which the
last few may be of interest, such as in outlier detection or some applications of image analysis [6].
6. The final step of the PCA is to perform the change of coordinates, such that the orthogonal
principal components chosen form the axes of the new dataset. We form the matrix ˆX as follows:
ˆX = ˆPT
A
where the columns of ˆP are the first m eigenvectors of P. This gives us the data in terms of the
principal components that we have chosen. [7]
2.1.2 PCA for Image Processing
Image processing is one of the most important uses of PCA at the moment. Suppose we have 50 images,
each of dimension 200,000. To perform the PCA, each pixel in each image is converted into a number
from 0 to 255, representing the intensity on the greyscale. Then, similar to the algorithm explained
above, images of large dimensions are broken down, and a set of eigenvectors is obtained. When
converted back to images, these vectors are known as eigenfaces or eigenimages . These are just the
features on the images that give the greatest variability in the set of images, for example, the eyes or the
shape of the faces. Once this is done, each individual image is expressed as a weighted sum of these
eigenfaces and reconstructed. [8] This image processing feature is then used for facial recognition and
verification software, commonly used in surveillance as well as biometric security. [9]
7

2.2 Multi-Dimensional Scaling
‘Multidimensional scaling (MDS) is a method that represents measurements of similarity (or
dissimilarity) among pairs of objects as distances between points of a low-dimensional multidimensional
space.’ ([10], p.3)
2.2.1 Motivating Example
Figure 2 represents the flying distances between 10 American Cities. We attempt to find 2-Dimensional
coordinates of the cities. (Figure 3)
Figure 2: Flying Mileages between 10 American cities [11]
Figure 3: Classical MDS of flying mileages between 10 American cities [11]
2.2.2 Classical Multi-Dimensional Scaling
Goal: Suppose we know that we have n p-dimensional data points. Given only the distance matrix
between the data points, we determine these data points.
That is, given the distance matrix 







d11 . . . d1n
...
...
...
dn1 . . . dnn








8

where dij is the Euclidean distance between xi and xj. We then attempt to ﬁnd the matrix X:
X =








˜x1
...
˜xn








where each ˜xi is a p-dimensional data point.
2.2.3 Procedure
Write T = XXT
.
d2
ij = (xi − xj)T
(xi − xj)
= xT
i xi + xT
j xj − 2xT
i xj
= Tii + Tjj − 2Tij
This can be rearranged to obtain
Tij = −
1
2
[d2
ij − Tii − Tjj]
= −
1
2
[d2
ij − di. − d.j + d2
..]
where
d2
i. =
1
n
n
j=1
d2
ij, d2
.j =
1
n
n
i=1
d2
ij, d2
.. =
1
n2
n
j=1
n
i=1
d2
ij
This is equivalent to writing
T = JAJ, where Aij = d2
ij, J = I −
1
n
˜h˜hT
and ˜h is an n × 1 vector of ones.
T = XXT
=⇒ T is symmetric and positive deﬁnite
=⇒ T is diagonalizable
=⇒ T = UΛUT
Also, T has non-negative eigenvalues, and rank T = rank(XXT
) = rank X = p. This means that T has
p positive eigenvalues and n − p eigenvalues identically equal to zero.
9

(Here we are assuming that p < n.)
T = UΛUT
= (UΛ
1
2 )Λ
1
2 UT
= (UΛ
1
2 )(UΛ
1
2 )T
Then X = UΛ
1
2 . Since there are n − p eigenvalues equal to 0, X = U Λ
1
2 , where
U =








| |
˜v1 . . . ˜vp
| |








, Λ =








λ1
...
λp








where ˜v1, . . . , ˜vp are the eigenvectors of T corresponding to the non-zero eigenvalues λ1, . . . , λp. [2] [12]
2.2.4 Application to Dimensionality Reduction
Given the matrix X, containing n data points in p dimensions,
X =












˜x1
˜x2
...
˜xn












n x p
where ˜xi are the n data points. We compute the n × n distance matrix:
D =








d11 . . . d1n
...
...
...
dn1 . . . dnn








with dij is the distance between the ith
and jth
data point.
From the above procedure: X = U Λ
1
2 . We can order the eigenvalues of Λ in decreasing order to
obtain ¯Λ and rearrange the eigenvectors in U accordingly to obtain ¯U. The matrix ¯X = ¯U ¯Λ
1
2 still correctly
represents the data.
The reduction of data in X to m dimensions where m < p is obtained by picking out the ﬁrst m
columns of ¯U and ¯Λ
1
2 .
The classical MDS as given above minimises the following loss function:
1≤i≤j≤n
(δij − dij)2
10

where δij represents the distance betwwen points i and j in the lower dimensional space, and dij is as above.
Given a data set, reducing the dimensionality using Classical Multidimensional Scaling actually gives
the same result as applying Principal Component Analysis. [1] [2]
2.2.5 Extensions to the Classical MDS
• The algorithm can be adapted to minimise different loss functions. For example, the weighted loss
function as follows can be minimized.
1≤i≤j≤n
aij(δij − dij)2
This class of multidimensional scaling is known as metric MDS.
• Non-Metric MDS: instead of preserving the distances between the points, the rank of the similarity
between the data points is preserved i.e. if point x1 is closer to x2 than to x3 in the original
dimensions then this property is preserved after dimensionality reduction.
This is especially useful when the input data points are qualitative. A dissimilarity matrix is built
and Non-Metric MDS applied to this matrix. [2] [12]
2.3 Limitations of Linear Techniques
The limitations of the linear dimensionality reduction methods (including the PCA and MDS) stem from
the fact that it is a linear transformation. This is major problem as many real-world data sets are non-
linear, and the above algorithms fail to capture this. As a knock-on effect, small perturbations in non-linear
data can have big influences in the principal components and the eigen-representation. [8]
With the PCA, one assumption the algorithm makes is that the directions with the largest variance
are assumed to be the most important. While this is usually the case, the following is an example where
it is not. Suppose we have sets of data stacked on top of each other (like pancakes), and we want to
cluster these sets together. The PCA is going to say that the length and breadth of the ‘pancakes’ are
the principal components. However, to separate these clusters of data we would be more interested in the
height axis of the data, and in this case, it is the axis of smallest variance. [13]
11

3 Non-Linear Methods
3.1 Locally Linear Embedding
Locally Linear Embedding (LLE) is a non-linear dimensionality reduction technique which aims to preserve
neighbourhood relations. The algorithm characterizes the local geometry of the dataset by finding linear
coefficients that reconstruct each data point from its neighbouring points. With linear methods such as
the PCA or MDS discussed earlier, faraway data points on non-linear manifolds are mapped to nearby
points in the plane, and thus the underlying structure of the manifold cannot be properly identified by
these algorithms.
Suppose there are n data points, x1, . . . xn each of dimensionality p. We want to map these data points
onto a m-dimensional space, where m < p. [14][15]
3.1.1 Procedure
The LLE of algorithm of Roweis and Saul [14] is as follows:
1. Select suitable neighbours for each data point xi.
This can be done in several ways. Usually, the k nearest points (using the Euclidean Distance) are
taken to be the neighbours of a point. While it is not absolutely necessary to use the k-nearest
neighbours here, the important thing is to establish some neighborhood for each point in a way
which conforms or adapts to the data. One could also consider all points within a fixed radius to be
neighbours, for example.
2. For each data point xi, compute weights Wij that best linearly reconstruct it from its neighbours.
The weights are computed such that the following cost function is minimised:
(W) =
n
i=1
xi −
n
j=1
Wijxj
2
subject to the following constraints:
(a) Each data point is constructed only from its neighbours, i.e. Wij = 0 if xj does not belong to
the set of neighbours of xi.
(b) The rows of the weight matrix sum to 1:
n
j=1 Wij = 1. This is to ensure that the LLE is
invariant under translation. This means that if we add a constant vector c to xi and all of its
12

neighbours, the function to be minimised remains unchanged.
xi + c −
n
j=1
Wij(xj + c) = xi + c −
n
j=1
Wijxj − c
= xi −
n
j=1
Wijxj
which is precisely the term in the cost function (and is to be minimised). Of course, if the
number of neighbours, k, is greater than the number of variables, p, then each data point can
be written exactly as a linear combination of its neighbours. Otherwise, the weights Wij can
be found by solving a least squares problem. In certain applications, one can also impose the
constraint that the weights are all positive. These weights are invariant to any rotation, scaling
and translations of any data point - an important property as we shall see later.
3. Compute the low-dimensional embedding vectors yi that minimise the embedding cost function
φ(Y ) =
n
i=1
yi −
n
j=1
Wijyj
2
where Y is a matrix containing the lower-dimensional data points. Although this looks similar to the
cost function earlier, this time, the weights are ﬁxed (calculated from step 2) and the coordinates yi
are to be optimised. As before, there are two constraints that need to be imposed on this optimisation
problem:
(a) 1
n
n
i=1 yi = 0
If the mean vector was not 0, we could just subtract it from all the embedded data points
without changing the quality of the solution, so this constraint is just for convenience.
(b) 1
n Y T
Y = I, where I is the m-dimensional identity matrix.
This is just to ensure that the variance-covariance matrix of Y is the m-dimensional identity
matrix, that is, the coordinates are all uncorrelated, and they have equal variance.
The LLE relies on a linear mapping (consisting of translations, rotations and scaling) that maps data
points in a high dimensional space to a lower dimensional space. As mentioned earlier, the weights
constructed to its neighbours are invariant to these transformations. The weights that reconstruct data
points in the m-dimensional space should also reconstruct the embedded coordinates in m dimensions.
Hence, local geometries of the high dimesnional space are expected to remain valid in the lower dimensional
13

Figure 4: The LLE algorithm [14]
space. The implementation of the LLE algorithm is also rather straightforward, as it only has one free
parameter, which is k, the number of neighbours we chose for each data point.
3.2 Diffusion Maps
3.2.1 Theory
We define the connectivity of two data points, xi, xj ∈ S, where S = {xl}n
l=1 is the data set, to be the
probability of jumping from xi to xj in one step of a random walk. We can express this connectivity in
terms of a ‘kernel’, which defines a measure of similarity within a certain neighbourhood and the function
is almost zero outside of the neighbourhood. The kernel satisfies the following properties:
1. k(xi, xj) = k(xj, xi)
2. k(xi, xj) ≥ 0
These properties allow us to follow the process of diffusion maps, as we shall see later on.
The most commonly used kernel in the literature on this topic [16][17] is the Gaussian kernel, which is
14

defined as
k(xi, xj) = exp −
xi − xj
2
α
(1)
We choose such that 0 < 1 and the neighbourhood of xi is defined as the set of xj such that
k(xi, xj) > . We can change and α to vary the properties of the neighbourhood.
We define d(xi):
d(xi) =
y∈S
k(xi, y) (2)
Then the probability of jumping from xi to xj in one step of a random walk, or the connectivity of xi and
xj can be expressed by:
p(xi, xj) =
1
d(xi)
k(xi, xj) (3)
d(xi) is the normalizing constant and we see that the sum, over all the points y in the data set S, of
probabilities of jumping from xi to y is 1.
y∈S
p(xi, y) =
y∈S
1
d(xi)
k(xi, y)
=
1
d(xi)
y∈S
k(xi, y)
= 1
We can now define a matrix P, with entries Pij = p(xi, xj) where xi, xj ∈ S, to be the diffusion matrix.
The i, jth
entry is the connectivity between data points xi and xj. In the parallel study of a random walk,
we would say that each entry represents the probability of jumping from point i to point j in one step. We
can take this matrix to the power t and note that we now have entries equal to the probability of going
from point i to point j in a random walk of t steps. For example, with 3 points and two steps,








p11 p12 p13
p21 p22 p23
p31 p32 p33








2
=








p2
11 + p12p21 + p13p31 p11p12 + p12p22 + p13p32 p11p13 + p12p23 + p13p33
p21p11 + p22p21 + p23p31 p21p12 + p2
22 + p23p32 p21p13 + p22p23 + p23p33
p31p11 + p32p21 + p33p31 p31p12 + p32p22 + p33p32 p31p13 + p32p23 + p2
33








In order to find the underlying structure of the dataset, we take P to a number of powers that has to be
determined. If we take it to too high a power then we will have so many steps in our random walk that
all points will be well connected. If we take too small a power, we may be limiting the number of steps
in our random walk to the extent that we don’t get a proper appreciation of the underlying geometry of
15

the data. Random walks along the data with lots of small steps will have a much higher probability than
those with any big steps. These small-step paths will follow the underlying structure of the data.
We now define the diffusion distance to be
Dt(xi, xj)2
=
y∈S
pt(xi, y) − pt(xj, y) 2
(4)
=
n
m=1
(Pt
)im − (Pt
)mj
2
(5)
where S = {x1, . . . , xn}. We define pt(xi, xj) = (Pt
)ij. We have not yet defined which metric, · , we
are using. It should be noted that this distance doesn’t suffer from the effects of noise, as it sums over
all paths between the two points. pt(·, ·), in the analogy to a random walk, represents the probability of
going from one point to another in t steps.
As paths that aren’t on the underlying structure of the dataset will have small probabilities, the most
influencial paths on the diffusion distance will be those on the underlying structure. If points x and y are
well connected, the probabilities for paths between x and w, and y and w, where w is a third point, will
be similar.
It may be apparent that calculating diffusion distances is a time consuming process and therefore it is
best to map the data points into a Euclidean space with distance between the mapped points the same as
the diffusion distance between the original points. We call this new Euclidean space the ‘diffusion space.’
The map between the space containing the data points and the diffusion space is the diffusion map. The
diffusion map will preserve the geometry of the data points which we assume to be of lower dimension
than the space containing the data points. In this case, preservation of geometry is expressed by diffusion
distances being equal to new Euclidean distances, as above. A good approximation for that Euclidean
distance can be obtained in a diffusion space of fewer dimensions that the original data space, as we shall
see.
Define
Yi :=








pt(xi, x1)
...
pt(xi, xn)








= [(Pt
)i]T
(ith row of Pt
, transposed) (6)
16

Now, the distance between Yi and Yj, in any metric, squared, is
Yi − Yj
2
=
y∈S
pt(xi, y) − pt(xj, y) 2
= Dt(xi, xj)2
in that metric (7)
We now have the distance between Yis to be the same as the diffusion distance, but we have no dimension
reduction. We must find a way of ignoring the least important dimensions of the diffusion space. We use
the following:
Lemma
Suppose K is a symmetric, n × n kernel matrix such that Kij = k(xi, xj). Then, we can use the diagonal
matrix D = diag(d(x1), . . . , d(xn)) to normalise K and produce a diffusion matrix
P = D−1
K. (8)
Then, the matrix P , defined as
P = D1/2
PD−1/2
, (9)
1. is symmetric,
2. has the same eigenvalues as P
3. has eigenvectors ˜xk such that when multiplied by D−1/2
and D1/2
give the left and right eigenvectors
of P, respectively.
Proof
P = D1/2
PD−1/2
= D1/2
D−1
KD−1/2
= D−1/2
KD−1/2
K is symmetric, so P will also be symmetric. P being symmetric implies that we can find Q and V such
that
P = QV QT
17

where V is diagonal and contains the eigenvalues of P , and Q’s columns are orthonormal eigenvectors of
P . Q is orthogonal so Q−1
= QT
. Now,
P = D−1/2
P D1/2
= D−1/2
QV QT
D1/2
= D−1/2
QV Q−1
D1/2
= (D−1/2
Q)V (D−1/2
Q)−1
= ZV Z−1
(10)
for Z = D−1/2
Q. We see that the eigenvalues of P and P are the same. Also, the right eigenvectors of P
are the columns of Z and the left eigenvectors of P are the rows of Z−1
= QT
D1/2
. This implies that the
right eigenvectors of P are
˜vl = D−1/2
˜xl (11)
where ˜xl is an eigenvector of P .
Similarly for the left eigenvectors of P,
˜ul = D1/2
˜xl. [17] (12)
Then, by (10), we have
P =
n
k=1
λk˜vk˜uT
k (13)
Also,
Pt
= (ZV Z−1
)t
= ZV t
Z−1
=
n
k=1
λt
k˜vk˜uT
k (14)
We notice that the ith row of Pt
can be written as a sum of the left eigenvectors as follows
(Pt
)i =
n
k=1
λt
k˜vk(i)˜uT
k (15)
where ˜vk(i) is the ith entry of ˜vk.
The left eigenvectors are a basis for the rows of Pt
so we can represent the rows of Pt
as vectors in the
18

co-ordinate system with the left eigenvectors as the basis. Note that we can write the ith row of Pt
, in
this new co-ordinate system, as
Li =








λt
1˜v1(i)
...
λt
n˜vn(i)








(16)
The issue is that this will almost definitely not be an orthonormal basis. However, the basis of eigenvectors
of P was orthonormal and we use this fact to get an inner product for which the left eigenvectors are an
orthonormal basis.
1 = ˜xT
k ˜xk for all k ∈ {1, . . . , n}
= (D−1/2
˜uk)T
(D−1/2
˜uk) by (12)
= ˜uT
k D−1
˜uk
0 = ˜xT
i ˜xj for i = j with i, j ∈ {1, . . . , n}
= (D−1/2
˜ui)T
(D−1/2
˜uj) by (12)
= ˜uT
i D−1
˜uj
We see that the left eigenvectors of P are orthonormal under the inner product using the matrix D−1
.
D−1
is a diagonal matrix with positive entries so is clearly a symmetric, positive definite matrix and the
inner product is well defined.
19

We can calculate the Euclidean distance squared between Li and Lj for i, j ∈ {1, . . . , n}.
Li − Lj
2
E =
n
k=1
λ2t
k (˜vk(i) − ˜vk(j))2
=
n
l=1
n
k=1
˜xT
l ˜xkλt
l(˜vl(i) − ˜vl(j))λt
k(˜vk(i) − ˜vk(j)) as the vectors {˜xq}n
q=1 are orthonormal
=
n
l=1
˜xT
l λt
l(˜vl(i) − ˜vl(j))
n
k=1
˜xkλt
k(˜vk(i) − ˜vk(j))
=
n
l=1
˜xT
l λt
l(˜vl(i) − ˜vl(j))D1/2
D−1
D1/2
n
k=1
˜xkλt
k(˜vk(i) − ˜vk(j))
=
n
l=1
˜xT
l λt
l(˜vl(i) − ˜vl(j))D1/2
D−1
n
k=1
˜xT
k λt
k(˜vk(i) − ˜vk(j))D1/2
T
=
n
k=1
˜xT
k λt
k(˜vk(i) − ˜vk(j))D1/2
2
[D−1]
=
n
k=1
˜xT
k D1/2
λt
k(˜vk(i) − ˜vk(j))
2
[D−1]
=
n
k=1
λt
k˜uT
k (˜vk(i) − ˜vk(j))
2
[D−1]
by (12)
=
n
k=1
λt
k˜uT
k ˜vk(i) −
n
k=1
λk˜uT
k ˜vk(j)
2
[D−1]
= (Pt
)i − (Pt
)j
2
[D−1]
by (15)
= Yi − Yj
2
[D−1] by (6)
= Dt(xi, xj)2
[D−1] by (7)
This shows us that the diffusion distance in the space containing the data points is the same as the
Euclidean distance in the diffusion space. We notice that our calculations of Euclidean distance in the
diffusion space are most affected by the dominant eigenvalues. We can reduce to m dimensions by taking
only the co-ordinates of Li in the dimensions corresponding to the dominant eigenvalues.
3.2.2 Procedure
Given a n-dimensional data set, {xi}n
i=1, the basic algorithm for a diffusion map dimensionality reduction
is as follows:
1. Define a kernel k(x, y) with properties as defined above. Create a kernel matrix, K, with Kij =
k(xi, xj).
2. Find the sum of each of the rows of K and for the ith row, call this d(xi). Construct the diagonal
20

matrix D = diag(d(x1), . . . , d(xn)). Use this to find the diffusion matrix P = D−1
K.
3. Decide how many dimensions you want the diffusion space to be (say m) and find the m most
dominant eigenvalues of the diffusion matrix and their corresponding eigenvectors.
4. Map into the lower dimensional diffusion space at time t, by using (16) but only including the entries
corresponding to the m dominant eigenvalues. These vectors are your new lower dimensional data
in the diffusion space, {yi}n
i=1
21

4 Comparison of PCA and Diffusion Maps
In this section we will compare the performance of PCA and Diffusion Maps on two different data sets,
both of which contain 1000 data points in 3 dimensions. The data sets are both generated according to a
one-dimensional underlying parameter and perturbed by some noise. To assess the two methods, we will
perform dimensionality reduction on the data sets, and then see if the first non-trivial co-ordinate of the
reduced data is in one-to-one correspondence with the underlying parameter.
4.1 Linear Dataset
The first data set we look at is generated by the following formula.








3
7
−1








T +
where T is the underlying parameter of the data set and is sampled from a Uniform(0,10) distribution and
is sampled from a MVN(0,I3) distribution and acts as a small noise perturbation.
The data is plotted in Figure 5.
Figure 5: Plot of the Linear Data set
22

(a) PCA (b) Diffusion Map
Figure 6: Plot of T vs reduced co-ordinates
Figure 7: Plot of the Linear data set coloured using the PCA and Diffusion Map co-ordinate
From Figure 6, we see that for both PCA and Diffusion Maps, the parameter T is monotonic with the
first non-trivial co-ordinates of the reduced data.
Figure 7 shows a plot of the original data with points coloured from red to blue according to the deciles
of the first non-trivial co-ordinates of the reduced data.
4.2 Non-Linear Dataset
The second data set was generated by the following formula:
A








cos(A)
sin(A)
cos(3A)








+
30
where A is the underlying parameter sampled from a Uniform(0,2π) distribution and is sampled from a
MVN(0,I3) distribution and again acts as a noise perturbation.
23

Figure 8: Plot of the Non-Linear Data set
We perform the same procedure as with the linear data set.
Figure 9: Plot of A vs reduced co-ordinates
In Figure 9 we see that the first co-ordinate of PCA is not monotonic with the parameter A, and
thus, it has not managed to capture the underlying structure of the data. However, the first non-trivial
co-ordinate of the diffusion map is monotonic with respect to A, and has therefore done a better job of
reducing the dimension of the data.
24

Figure 10: Plot of the Non-Linear data set coloured using the PCA and Diﬀusion Map co-ordinate
In Figure 10 we can see clearly that when coloured by the Diﬀusion Map co-ordinate, our plot changes
from red to blue monotonically as we travel along the curve. However, with the PCA co-ordinate, the
colour changes from purple to blue to red to purple again, showing that PCA has not found the underlying
structure of the data. [18] [22][24][25]
25

5 Clustering and dimensionality reduction applications
5.1 Clustering
5.1.1 Theory
‘Data clustering (or just clustering), also called cluster analysis, segmentation analysis, taxonomy analysis,
or unsupervised classification, is a method of creating groups of objects, or clusters, in such a way that
objects in one cluster are very similar and objects in different clusters are quite distinct.’ ([19],p.1)
Several methods relying on different notions of similarity are used to cluster the data. Some common
clustering methods are:
• Hierarchical Clustering
A hierarchy of clusters is obtained. There are two ways of going about hierarchical clustering: an
agglomerative approach and a divisive approach.
The agglomerative approach consists of starting with n individual clusters and progressively merging
the two most similar clusters at each stage until a single cluster is obtained. The divisive approach
involves starting with a single cluster and progressively splitting the clusters until n clusters each
containing a single data point is obtained.
An example of a hierarchical cluster of the data points p, q, r, s, t is
26

Figure 11: Example of a hierarchical cluster of the data points p, q, r, s, t [20]
• Distribution Clustering
Clusters are created by grouping data which are more likely to be drawn from the same distribution.
Different clusters may be described by a distribution with different parameters or they may be
described by altogether different distributions.
• Sums of Squares Clustering
Clusters are created such that the intra-cluster sum of squares is minimized. k-means clustering is
an example of Sums of Squares Clustering.
k- means clustering minimizes
k
j=1 xi∈Cj
xi − cj
2
where cj is the cluster centre of the cluster Cj. [2]
For this project we will use the k-means Clustering method. It is one of the most popular clustering
methods as it is fast and it is easily implemented. k-means clustering can however be sensitive to the
chosen starting points of the algorithm. It also requires the number of clusters to be defined by the user.
5.1.2 Procedure
The k-means clustering algorithm is as follows:
1. Randomly choose k points as cluster centres.
27

2. Each data point is placed in the cluster whose cluster centre it is closest to.
3. Calculate new cluster centres by setting them to the mean of the points in their corresponding cluster.
4. Each data point is again placed in the cluster whose cluster centre it is closest to.
5. Repeat Steps 2 and 3 until none of the data points change cluster, or until a specified maximum
number of iterations is reached. [12]
(a) Given these data points (b) Connect all data points to the nearest cluster centre
(c) Find the new cluster centres (d) Repeat (b) and (c) until no data points change clusters
Figure 12: Visualiztion of the k-means clustering algorithm [21]
5.2 Use of Diffusion Maps in Clustering
In this section we will give an example where Diffusion Maps can help cluster non-linear data with the
k-means algorithm. We will use a ‘Chainlink’ toy clustering data set [26], which has 1000 data points in 3
dimensions that form the shape of two interlocked rings (shown in figure 13). We would like our clustering
technique to identify each ring as a separate cluster. We will try applying the k-means algorithm to the
28

Figure 13: Plot of the Chainlink data set [26]
data set in its original co-ordinates. We will also try computing the diﬀusion co-ordinates of the data
set and then applying the k-means algorithm to the data in the diﬀusion co-ordinates and compare the
results. Note that in both cases we will use k=2 as the number of clusters for k-means.
29

(a) K-Means applied directly (b) K-Means applied to Diffusion Map co-ordinates
Figure 14: Plot of Clustered data set
We can see in figure 14(a) that K-means applied directly to the data set fails to identify the two rings
as clusters, this is because it uses Euclidean distances. Transforming the data into diffusion co-ordinates
allows us to identify each ring as as separate cluster.
This occurs because the diffusion distance between two points is small if they are connected by lots of
points which are close together, and so any two points within the same ring will have small diffusion
distance between them. On the contrary, points in separate rings will have large diffusion distance, as
there is a low probability of ’jumping’ from one ring to the other in the diffusion process. [22][24]
30

5.3 Application to Image Processing
In this section we will be investigating the data set shown in figure 15
(a) Our data set
(b) 2 original images [27]
Figure 15
This data set consists of 2 different images, each rotated 40 times. The images are each 160x160 pixels,
each with 3 RGB values, and so our data set has size n = 80, with p = 76800 (160 × 160 × 3) variables.
We would like to have an algorithm that can separate the data into two clusters, each containing the
rotations of one of the original images.
We will first try applying k-means directly to the data set, and then try applying k-means to the data set
reduced to diffusion co-ordinates (using k=2 in both cases). We ran both algorithms 1000 times; the total
run time and number of times the algorithms successfully cluster the data set into clusters of the original
images are shown in Figure 16.
It is clear that in this case applying k-means to the data in diffusion co-ordinates is both faster to run,
and more successful.
31

Time (s) # Successes
Direct k-means 918.97 114
Diffusion k-means 18.97 988
Figure 16
Figure 17: Plot of the data set with respect to the first and second Diffusion co-ordinates
In figure 17 we have reduced the dimension of the data to the first 2 diffusion co-ordinates so that we
can plot it and therefore more easily visualise it. The two images are clearly separated into two clusters
as expected from our clustering results above.
Now we would like to investigate whether the Diffusion Map has managed to maintain the underlying
structure of the data; since the data within each cluster are generated by a one dimensional parameter
(degree of rotation), we can imagine that the data points approximately lay on some curve in Rp
space,
with degree of rotation monotonically increasing as we travel along the curve. We would like the same to
be true in our reduced data set.
32

Figure 18: Close up of each cluster
We can see clearly in ﬁgure 18 that each cluster forms a curve, and as we travel along each curve, the
degree of rotation of the image changes monotonically. Therefore even though we have drastically reduced
the dimension of the data, from 76800 to 2, we have preserved the underlying structure. [22][23]
33

6 Application to financial data
In this section, we will show some of the things that dimensionality reduction allows you to do in the study
of large data sets. We take the data for the S&P 500 from 1st November 2014 to 9th November 2015 [28].
We treat each of the 494 companies as a 259 dimensional data point. 259 is the number of days on which
business was done between the above dates. We will use Diffusion Maps to reduce to two dimensions and
then use k-means clustering to get an impression of which companies’ stock prices tend to follow similar
paths.
Before analysing the data, we normalised all of the data points. We did this to best evaluate companies
whose stock prices follow similar paths relative to their means as well as relative to their stock price. This
is necessary in order to compare companies with vastly different stock prices. For example, Netflix’s stock
price varies from about 100 to 700 over the time period, averaging about 350, whilst Xerox’s stock price
varies from 9.5 to 15, averaging 12. First, we found the mean, ¯Xi, of every company’s stock prices over
the year and standard deviation, si. For a company’s data, ˜Xi, we performed the following:
˜Xi →
1
si








˜Xi −








¯Xi
...
¯Xi
















The following is a scatterplot of the first two diffusion coordinates with the ten large black spots represent-
ing the centroids of the clusters. It should be noted that the choice of k = 10 for the k-means algorithm
is, in this example, somewhat arbitrary. Anywhere between 6 and 15 clusters is reasonable and this is
probably due to the sparsity of the data with the first diffusion co-ordinate between −2 and 2, as can be
seen in the figure.
34

Figure 19: Scatterplot of first two diffusion co-ordinates and cluster centroids
Figure 20 shows the points in their clusters where each cluster is a different colour:
Figure 20: Scatterplot of colour clustered points
We had originally surmised that companies in a similar industry should follow similar paths. We found
this to be false for all industries other than energy. Figure 21 a plot of the diffusion coordinates with
energy companies highlighted.
35

Figure 21: Scatterplot with energy companies highlighted
We see that the large majority of the energy companies are in the left most cluster. We now plot the
stock prices of ten of the energy companies in Figure 22 and note they follow similar paths. We have
included the black line which is Facebook’s stock prices for contrast.
36

Figure 22: Plot of stock prices for ten energy companies and Facebook in black
Next, we will study five companies with varying distances between them in diffusion space. Here is the
scatterplot of the first two diffusion coordinates of the five companies.
37

Figure 23: Plot of diﬀusion coordinates for 5 companies
We compare the plots of the companies’ normalised stock prices as opposed to the full stock prices
compared in the example with energy companies. This is because Google’s stock price dwarfs those of the
others to the extent that they all look like straight lines.
38

Figure 24: Comparing plots for the 5 companies
It is quite clear that Google and Facebook follow the most similar paths. This is exactly what we
expected as they cluster together and have diffusion coordinates that are very close to each other. None
of the other companies follow paths that are particularly similar to each other.
One of the most important things that we saw in this data set was the increase in computational efficiency
due to diffusion maps. The MATLAB function for direct k-means on the full data set took, on average,
about 13 seconds, whilst diffusion maps and k-means on the diffusion coordinates took just over a second.
It is interesting to note that for this data set, we get similar results from PCA, with very similar clustering,
as well as computational efficiency increases. [29]
39

7 Conclusion
Dimensionality reduction is an important tool in data analysis as it allows us to more easily visualise data,
reduce computation time, and apply statistical techniques that may fail in high dimensions.
We have investigated linear and non-linear dimensionality reduction techniques, and have outlined limita-
tions with linear methods.
With particular focus on Diﬀusion Maps, we demonstrated applications in image processing; and in cluster
analysis which we applied to real ﬁnancial data.
40

References
[1] Lee J.A., Verleysen M. Nonlinear Dimensionality Reduction. Springer Series Information Science and
Statistics. New York: Springer 2007.
[2] Webb A.R., Copsey K.D. Statistical Pattern Recognition 3rd Edition. UK: John Wiley & Sons, Ltd.
2011.
[3] Verleysen M, Fran¸cois D. The Curse of Dimensionality in Data Mining and Time Series Prediction.
In: Cabestany J., Prieto A., Sandoval F. (eds) Computational Intelligence and Bioinspired Systems.
2005.Volume 3512 of the series Lecture Notes in Computer Science p 758-770
[4] Shalizi C. Principal Components Analysis. Carnegie Mellon University; 2009 pp36 - 350. Available
from http://www.stat.cmu.edu/ cshalizi/uADA/12/lectures/ch18.pdf [Accessed 5th June 2016]
[5] Socher, R. Manifold Learning and Dimensionality Reduction Available from:
http://www.socher.org/uploads/Main/DiffusionMapsSeminarReport RichardSocher.pdf [Accessed
2nd June 2016]
[6] Joliffe I.T., Cadima J. Philosophical transactions. Series A, Mathematical, physical, and engineering
sciences Principal Component Analysis: A Review and Recent Developments. 2016, Vol.374(2065),
pp.20150202 Avaiable from DOI: 10.1098/rsta.2015.0202.
[7] Smith L.I. A Tutorial on PCA. Available from:
http://www.cs.otago.ac.nz/cosc453/student tutorials/principal components.pdf [Accessed 26th May
2016]
[8] Hlavác V. Principal Component Analysis Application to images. Czech Technical University in Prague.
http://cmp.felk.cvut.cz/ hlavac/TeachPresEn/11ImageProc/15PCA.pdf [Accessed 28th May 2016]
[9] Mahvish Nasir. What is PCA (Explained from face recognition point of view). 2013; Available from
https://www.youtube.com/watch?v=g5 tonFnfaQ [Accessed 1st June 2016]
[10] Ingwer B., Groenen P.J.F. Modern Multidimensional Scaling Theory and Applications. 2nd
Edition. Springer Series in Statistics. Springer-Verlag New York Inc 2007. Avaliable from
http://link.springer.com/book/10.1007/0-387-28981-X/page/1 [Accessed 3rd June 2016]
41

[11] Young F.W. MULTIDIMENSIONAL SCALING Available from
http://forrest.psych.unc.edu/teaching/p208a/mds/mds.html [Accessed 3rd June 2016]
[12] Vathy-Fogarassy A., János A.Graph-Based Clustering and Data Visualization Algorithms.
SpringerBriefs in Computer Science. London: Springer- Verlag 2013. Available from
http://link.springer.com/book/10.1007/978-1-4471-5158-6/page/1/ [Accessed 7th June 2016]
[13] What are some of the limitations of principal component analysis? Available from:
https://www.quora.com/What-are-some-of-the-limitations-of-principal-component-analysis [Ac-
cessed 3rd June 2016]
[14] Roweis S.T., Saul L.K. Nonlinear Dimensionality Reduction by Locally Linear Embedding. Science
Magazine. 2000; 290(5500), :2323-2326
[15] Shalizi C. Non-Linear Dimensionality Reduction I: Local Linear Embedding. Carnegie Mellon Uni-
versity; 2009 pp36 - 350. Available from http://www.stat.cmu.edu/ cshalizi/350/lectures/14/lecture-
14.pdf [Accessed 3rd June 2016]
[16] De la Porte J, Herbst B.M., Hereman W., Van Der Walt S.J. An introduction to diffusion maps. In:
Proceedings of the 19th Symposium of the Pattern Recognition Association of South Africa (PRASA
2008), Cape Town, South Africa 2008 Nov 26 (pp. 15-25).
[17] Coifman RR, Lafon S. Diffusion maps. Applied and computational harmonic analysis. 2006;21(1):5-30.
[18] Bubacarr B. Diffusion Maps: Analysis and Applications. MSc thesis. University of Oxford;2008
[19] Gan G., Ma C. and Wu J. Data Clustering: Theory, Algorithms, and Applications Philadelphia, PA
19104 ASA-SIAM Series on Statistics and Applied Probability, SIAM, , ASA, Alexandria, VA, 2007.
[20] Front Line Solvers. HIERARCHICAL CLUSTERING. Available from
http://www.solver.com/xlminer/help/hierarchical-clustering-intro [Accessed 9th June 2016]
[21] Shabalin, Andrey A. [Animated gifs.] Available from http://shabal.in/visuals/kmeans/1.html. [Ac-
cessed 8 June 2016]
[22] Joseph Richards (2014). diffusionMap: Diffusion map. R package version 1.1-0. https://CRAN.R-
project.org/package=diffusionMap
42

[23] Simon Urbanek (2014). jpeg: Read and write JPEG images. R package version 0.1-8.
https://CRAN.R-project.org/package=jpeg
[24] Ligges, U. and M¨achler, M. (2003). Scatterplot3d - an R Package for Visualizing Multivariate Data.
Journal of Statistical Software 8(11), 1-20.
[25] Venables, W. N. & Ripley, B. D. (2002) Modern Applied Statistics with S. Fourth Edition. Springer,
New York. ISBN 0-387-95457-0
[26] Ultsch, A.: Clustering with SOM: U*C, In Proc. Workshop on Self-Organizing Maps, Paris, France,
(2005) , pp. 75-82
[27] A J Mestel. Home Page. Available from: http://wwwf.imperial.ac.uk/ ajm8/ [Accessed 5th June
2016]
[28] Missaoui B. Scorex fellow in Statistics. Personal Communication. 31st May 2016
[29] Richards J. Diﬀusion Map. [MATLAB] Available from: http://www.stat.berkeley.edu/ jwrichar/software.html
[Accessed 1st June 2016]
43

Dimensionality Reduction Techniques

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (15)

Similar a Dimensionality Reduction Techniques

Similar a Dimensionality Reduction Techniques (20)

Dimensionality Reduction Techniques