2. On a similar grounds, a multidimensional data such as
the HSI can be modeled by a multidimensional Gaussian
mixture (GM)[1]. Normally, a GM in the form of the PDF
for z ∈ RP is given by
(PCA), this technique is also used to determine the inherent dimensionality of the imagery data. This transformation
segregates noise in the data and reduces the computational
requirements for subsequent processing [5]. Figure 2 shows
the 2D “scatter” plot of the first two MNF components of
the original cuprite data.
L
αi N (z, μi , Σi )
p(z) =
i=1
200
where
1
(2π)P/2 |Σi |1/2
1
−1
e{− 2 (z−μi ) Σi
(z−μi )}
100
.
MNF Band 2
N (z, μi , Σi ) =
150
Here L is the number of mixture components and P the
number of spectral channels (bands). The GM parameters
are denoted by λ = {αi , μi , Σi }. These parameters are
estimated using maximum likelihood (ML) by means of the
expectation-maximization (EM) algorithm.
50
0
−50
−100
−150
−200
−1000
−800
−600
−400
−200
0
MNF Band 1
200
400
600
800
Figure 2. 2D scatter plot of the data using the
first two MNF bands
3 Dynamic Component Allocation
In spite of the good mathematical tractability of GMM,
there are challenges trying to train a GM with a local algorithm like EM. First of all, the true number of mixture
components is usually unknown. Eventually, not knowing
the true number of mixing components is a major learning
problem for a mixture classifier using EM [4]. The solution to this problem is a dynamic algorithm for Gaussian
mixture density estimation that could effectively add and
remove kernel components to adequately characterize the
input data. This methodology also increases the chances to
escape getting stuck in one of the many local maxima of
the likelihood function. The solution to the component initialization is based on a greedy EM approach which begins
the GM training with a single component [6]. Components
or modes are then added in a sequential manner until the
likelihood stops increasing or the incrementally computed
mixture is almost as good as any mixture in that form. This
incremental mixture density function uses a combination of
global and local search each time a new kernel component
is added to the mixture. We shall now describe in detail the
following three operations-merging, splitting and pruning
of the GMM.
Figure 1. The scene is a 1995 AVIRIS image
of Cuprite field in Nevada with the training regions overlayed.
Figure 1 shows data sets used in our experiments that
belong to 1995 Cuprite field scene in Nevada. The training
regions in the HSI data are identified heuristically from mineral maps provided by Clark et. al. [2]. The remote sensing data sets that we have used in our experiments come
from an Airborne Visible/Infrared Imaging Spectrometer
(AVIRIS) sensor image. AVIRIS is a unique optical sensor that delivers calibrated images of the upwelling spectral
radiance in 224 contiguous spectral channels (bands) with
wavelengths from 0.4-2.5 μm. AVIRIS is flown all across
the US, Canada and Europe.
Since, HSI imagery is highly correlated in the spectral
direction using the minimum noise fraction (MNF) transform is natural for dimensionality reduction. This transform is also called the noise adjusted principal component
(NAPC) transformation. Like principal component analysis
3.1
Merging of Modes
Merging is one of the processes in this proposed training
scheme wherein a single mode is created from two identical
408
3. ones. The closeness between the mixture modes is given
by a metric d. For example, consider two PDF’s p1 (x) and
p2 (x). Let there be collection of points near the central peak
of p1 (x) represented by xi ∈ X1 and another set of points
near the central peak of p2 (x) denoted by xi ∈ X2 . In which
case the closeness metric d is given by
d = log
p2 (xi )
xi ∈X1 p1 (xi )
xi ∈X1
p1 (xi )
xi ∈X2 p2 (xi )
xi ∈X2
which directly determines the number of mixture components. This kurtosis measure is given by
Ki =
wn,i =
(4)
n
wn,i ( Z√−μi )4
Σ
N
n=1
where
i
wn,i
−3
N (zn , μi , Σi )
N N (z , μ , Σ )
Σn=1
n
i
i
·
Therefore, if |Ki | is too high for any component (mode)
i, then the mode is split into two. This could be modified
to higher dimension by considering skew in addition to the
kurtosis, where each data sample Zn is projected on to the
j
j th principal axis of Σi in turn. Let zn,i
(Zn − μi ) Vij
where Vij is the j th column of V, obtained from the SVD
of Σi . Therefore, for each j
Notice that this metric is zero when p1 (x) = p2 (x) and
greater than zero for p1 (x) = p2 (x). A pre-determined
threshold is set to determine if the modes are too close to
each other. Since we assume that p1 (x) and p2 (x) are just
two Gaussian modes, it is easy to know where some good
points for X1 and X2 are. We choose the means (centers)
and then go one standard deviation in each direction along
all the principal axes. The principal axes are found by SVD
decomposition of R (the Cholesky factor of the covariance
matrix).
If the two modes are found to be too close, they will be
merged forming a weighted sum of two modes(weighted by
α1 , α2 ). The mean for this newly merged mode will be
•
j
Ki,j =
•
α1 μ1 + α2 μ2
(5)
α1 + α2
Here μ1 and μ2 are means of the components before
merging and μ is the resultant mean after merging of the
two components. The proper way to form a weighted combination of the covariances is not simply a weighted sum of
the covariances, which does not take in to account the separation of means. Therefore, one needs to implement a more
intelligent technique. Consider the Cholesky decomposition of the covariance matrix Σ = R R. It is possible to
consider the rows (P )R to be samples of P -dimensional
vectors whose covariance is Σ, where P√ the dimension.
is
1
The sample covariance is given by P ( P )2 R R = Σ.
Now, given the two modes to merge, we regard (P )R1
and (P )R2 as two populations to be joined. The sample
covariance of the collection of rows is the desired covariance. But this will assign equal weight to the two populations. To weight them with their respective weights, we
multiply them by α1α1 2 and α1α2 2 . Before they can
+α
+α
Zn,i 4
N
n=1 wn,i ( si )
N
n=1 wn,i
−3
j
ψi,j =
μ=
•
Zn,i 3
N
n=1 wn,i ( si )
N
n=1 wn,i
mi,j = |Ki,j | + |ψi,j |
where
s2 =
i
N
j
2
n=1 wn,i (zn,i )
.
N
n=1 wn,i
Now, if mi,j > τ , for any j, split mode i. Further, split
the mode by creating the modes at μ = μi + vi,j Si,j
and μ = μi − vi,j Si,j , where Si,j is the j th singular
value of Σi . The same covariance Σi is used for each
new mode. The decision to split or not also depends
upon the mixing proportion αi . The splitting does not
take place if the value of αi is too small. The optional
threshold parameter allows control over splitting. A
higher threshold is less likely to split.
3.3
Pruning of Modes
When the number of components becomes high they are
pruned out as the mixing weight αi falls. Pruning is killing
weak modes. This procedure ensures removal of weak
modes from the overall mixture. A weak mode is identified by checking αi with respect to certain threshold. Once
identified they are obliterated and further re-normalizing αi
such that i αi = 1. It is equally important that the algorithm does not annihilate many moderately weak modes all
at once. This is achieved by setting up two input threshold
values.
be joined, however, they must be shifted so they are rereferenced to the new central mean.
3.2
N
n=1
Splitting of Components
On the other hand if the number of components is too
low, then the components are split in order to increase the
total number of components. Vlassis et. al. [6] define
a method to monitor the weighted kurtosis of each mode
409
4. 4 Minimum-Message Length Criteria
Let us now consider the second mixture learning technique based on the minimum message length (MML) criterion. This method is also known by the name of FigueredoJain algorithm [13]. Using the MML criterion and applying
it to mixture models leads to the following objective function
V
2
log(
{c:αc >0}
=
100
50
0
−50
−100
−150
−1000
−800
−600
−400
−200
0
MNF BAND 1
200
400
600
Figure 3. Intermediate learning step of GMM
before achieving convergence using the MML
criterion.
Cnz
N
N αc
)+
log +
12
2
12
Cnz (V + 1)
− log L(Z, λ)
2
where N is the number of training points, V is the
number of free parameters specifying a component, Cnz
is the number of components with non-zero weight in the
mixture (αc > 0), λ is the parameter list of the GMM
i.e. {α1 , μ1 , Σ1 , · · ·, αC , μC , ΣC }, and the last expression
log L(Z, λ) is the log-likelihood of the training data given
the distribution parameters λ.
The EM algorithm can be used to minimize the above
equation with fixed Cnz . This leads to the M-step with component weight updating formula
αi+1
c
150
N
V
n=1 wn,c ) − 2 }
C
N
V
j=1 max{0, ( n=1 wn,j ) − 2
200
150
100
MNF Band 2
∧(λ, Z) =
200
MNF BAND 2
Finally, once the number of modes settles out, then Q
stops increasing, and convergence is achieved. Hence, the
DCA technique for GMM approximation of HSI observations demonstrates that the combination of covariance constraints, mode pruning, merging and splitting can result in a
good PDF approximation of the HSI mixture models.
Outliers
Class 1
Class 2
Class 3
Miss−classification
50
0
−50
−100
−150
−200
−1000
−800
−600
−400
−200
0
MNF Band 1
200
400
600
800
Figure 4. Classification using MML learning
criterion.
max{0, (
}
This formula contains an explicit rule of annihilating
components by setting their weights to zero. The other
distribution parameters are updates similar to the previous
method. Figure 3 shows an intermediate step of mixture
learning using the MML criterion.
Having described the two learning criteria, we now evaluate their performance by using a simple Bayesian classifier. Figures 4,5 depict the classification of the three HSI
mixture classes based on the two learning criteria. Classification, outliers and miss-classifications are shown in these
figures for the two cases. Table 1 shows the classification
performance comparison of the DCA learning method vs.
the MML. The tabulation shows various classification and
miss-classification rates of the two learning methodologies
based on the amount of training data utilized.
410
200
150
MNF Band 2
5 Classification Results
250
100
Outliers
Class 1
Class 2
Class 3
Miss−Classification
50
0
−50
−100
−150
−200
−1000
−800
−600
−400
−200
0
200
MNF Band 1
400
600
800
1000
Figure 5. Classification using the DCA learning criterion.
5. plications for noice removal,” in IEEE Transactions on Geoscience and Remote Sensing, pp. 6574, Vol.26, No.1, 1988.
Table 1. Classification performance comparisons between the DCA learning criterion and
MML Learning criterion.
Training Data
(in percentage)
Utilized
55 %
60 %
65 %
70 %
75 %
80 %
Classification
(in percentage)
MML
DCA
68.7691 69.8148
73.5235 74.6569
72.6667 73.5882
74.54
75.1457
75.1541 74.3254
75.6765 75.5392
[6] N. Vlassis and A. Likas, “A kurtosis-based dynamic approach to Gaussian mixture modeling,”
IEEE Transactions on Systems, Man and Cybernetics, vol. 29, pp. 393399, 1999.
Miss-Class
(# of pixels)
MML DCA
1087
1054
957
970
647
606
1230
866
857
752
404
468
[7] N. Vlassis and A. Likas, “A greedy EM for Gaussian mixture learning,” Neural Processing Letters,
vol. 15, pp. 7787, 2002.
[8] G. F. Hughes, “On mean accuracy of statistitical
pattern recognizers”, IEEE Transactions on Information Theory, vol. 14, no. 1, pp. 55-63, 1968.
[9] D. A. Landgrebe, Signal Theory Methods in Multispectral Remote Sensing. ISBN 0-471-42028-X,
Wiley Inter-Science Publishers, 2003.
6 Conclusions
In this paper, we evaluated the performance of two
PDF mixture learning criterion for the reduced dimensionality material classification of Hyperspectral remote sensing data. The results show that the two methods have nearly
identical classification performance. The outcome of this
paper presents a possible integration of advanced data analysis and modeling tools to scientists, advancing the stateof-the-practice in the utilization of satellite image data to
various types of Earth System Science studies.
[10] B. S. Everitt and D. J. Hand, Finite Mixture Distributions. London: Chapman and Hall, 1981.
[11] T. Hastie, R. Tibshirani and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference and Prediction, Springer, 2001.
[12] A. P. Dempster, N. M. Laird, and D. B. Rubin,
“Maximum likelihood from incomplete data via
the EM algorithm,” Journal of the Royal Statistical Society, vol. 39B, No. 1, pp. 138, 1977.
References
[13] M. A. T. Figueiredo and A. K. Jain, “Unsupervised
Learning of Finite mixture models,” IEEE Trans.
Pattern Anal. Machine Intell., vol. 24, no. 3, pp.
381396, March 2002.
[1] D. G. Manolakis, and G. Shaw, “Detection Algorithms for Hyperspectral Imaging Applications,”
IEEE Signal Processing Magazine, Vol. 19, Issue
1, January 2002, ISSN 1053-5888.
[14] A. K. Jain and R. Dubes, Algorithms for Clustering Data. Englewood Cliffs, New Jersey: Prentice
Hall, 1988.
[2] R. N. Clark, A. J. Gallagher, and G. A. Swayze,
“Material absorption band depth mapping of imaging spectrometer data using a complete band
shape least-squares fit with library reference spectra”, Proceedings of the Second Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) Workshop, JPL Publication 90-54, pp. 176-186, 1990.
[15] B. H. Huang, “Maximum Likelihood estimation
for mixture multivariate stochastic observations of
Markov chains”, AT&T Technical Journal,vol. 64,
No. 6, pp. 1235-1249, 1985.
[16] L. Liporace, “Maximum likelihood estimation
for multivariate observations of Markov sources,”
IEEE Transactions on Information Theory, vol. 28,
Issue 5, pp. 729-734, 1982.
[3] G. J. McLachlan and K. E. Basford, Mixture Models: Inference and Applications to Clustering. New
York: M. Dekker, 1988.
[4] G. J. McLachlan and D. Peel, Finite Mixture Models. New York: Wiley, 2000.
[17] A.K. Jain, R. Duin and J. Mao, Statistical pattern
recognition: A review, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22,
no. 1, pp. 4-38, 2000.
[5] A.A. Green, M. Berman, P. Switzer and
M.D. Craig, “A transformation for ordering multispectral data in terms of image quality with im-
411