2009 asilomar

Active Learning schemes for Reduced Dimensionality Hyperspectral
Classification
Vikram Jayaram, Bryan Usevitch
Dept. of Electrical & Computer Engineering
The University of Texas at El Paso
500 W. University Ave, El Paso, Texas 79968-0523
{jayaram, usevitch }@ece.utep.edu ∗
Abstract

basis. Concerning the second problem, feature extraction
and optimal band selection are the methods most commonly
used for finding useful features in high-dimensional data.
On the other hand, reduced dimensionality algorithms suffer from theoretical loss of performance. This performance
loss occurs due to reduction of data to features, and further
approximating the theoretical features to PDFs. Although,
a loss of performance is eminent in case of PDF based classification methods, their evaluation in an HSI classification
regime is the prime focus of this paper.

Statistical schemes have certain advantages which promote their use in various pattern recognition problems. In
this paper, we study the application of two statistical learning criteria for material classification of Hyperspectral remote sensing data. In most cases, the Hyperspectral data is
characterized using a Gaussian mixture model (GMM). The
problem in using statistical model such as the GMM is the
estimation of class conditional probability density functions
based on the exemplar available from the training data for
each class. We demonstrate the usage of two training methods - dynamic component allocation (DCA) and the minimum message length (MML) criteria that are employed to
learn the mixture observations. The training schemes are
then evaluated using the Bayesian classifier.

2 Mixture Modeling
In order to define the relevance of using PDF finite mixture model for HSI, let us consider a random variable X, the
finite mixture models decompose a PDF f (x) into sum of
K class PDFs. In other words, the density function f (x) is
semiparametric, since it may be decomposed into K components. Let fk (x) denote the k th class PDF. The finite
mixture model with K components expands as

1 Introduction
Classification of Hyperspectral imagery (HSI) data is a
challenging problem for two main reasons. First, with limited spatial resolution of HSI sensors and/or the distance
of the observed scene, the images invariably contain pixels
composed of several materials. It is desirable to resolve the
contributions of the constituents from the observed image
without relying on high spatial resolution images. Remote
sensing cameras have been designed to capture a wide spectral range motivating the use of post-processing techniques
to distinguish materials via their spectral signatures. Secondly, available training data for most pattern recognition
problems in HSI processing is severely inadequate. Under
the framework of statistical classifiers, Hughes [8] was able
to demonstrate the impact of this problem on a theoretical

K

f (x) =

ak fk (x),

where ak denotes the proportion of the k th class. The proportion ak may be interpreted as the prior probability of
observing a sample from class k. Furthermore, the prior
probabilities ak for each distribution must be nonnegative
and sum-to-one, or
ak ≥ 0

f or k = 1, · · ·, K,

(2)

where
K

∗ This

ak = 1.

work was supported by NASA Earth System Science Fellowship
NNX06AF68H.

978-1-4244-5827-1/09/$26.00 ©2009 IEEE

(1)

k=1

(3)

k=1

407

Asilomar 2009

On a similar grounds, a multidimensional data such as
the HSI can be modeled by a multidimensional Gaussian
mixture (GM)[1]. Normally, a GM in the form of the PDF
for z ∈ RP is given by

(PCA), this technique is also used to determine the inherent dimensionality of the imagery data. This transformation
segregates noise in the data and reduces the computational
requirements for subsequent processing [5]. Figure 2 shows
the 2D “scatter” plot of the first two MNF components of
the original cuprite data.

L

αi N (z, μi , Σi )

p(z) =
i=1

200

where
1
(2π)P/2 |Σi |1/2

1

−1

e{− 2 (z−μi ) Σi

(z−μi )}

100

.

MNF Band 2

N (z, μi , Σi ) =

150

Here L is the number of mixture components and P the
number of spectral channels (bands). The GM parameters
are denoted by λ = {αi , μi , Σi }. These parameters are
estimated using maximum likelihood (ML) by means of the
expectation-maximization (EM) algorithm.

50
0
−50
−100
−150
−200
−1000

−800

−600

−400

−200

0

MNF Band 1

200

400

600

800

Figure 2. 2D scatter plot of the data using the
first two MNF bands

3 Dynamic Component Allocation
In spite of the good mathematical tractability of GMM,
there are challenges trying to train a GM with a local algorithm like EM. First of all, the true number of mixture
components is usually unknown. Eventually, not knowing
the true number of mixing components is a major learning
problem for a mixture classifier using EM [4]. The solution to this problem is a dynamic algorithm for Gaussian
mixture density estimation that could effectively add and
remove kernel components to adequately characterize the
input data. This methodology also increases the chances to
escape getting stuck in one of the many local maxima of
the likelihood function. The solution to the component initialization is based on a greedy EM approach which begins
the GM training with a single component [6]. Components
or modes are then added in a sequential manner until the
likelihood stops increasing or the incrementally computed
mixture is almost as good as any mixture in that form. This
incremental mixture density function uses a combination of
global and local search each time a new kernel component
is added to the mixture. We shall now describe in detail the
following three operations-merging, splitting and pruning
of the GMM.

Figure 1. The scene is a 1995 AVIRIS image
of Cuprite field in Nevada with the training regions overlayed.
Figure 1 shows data sets used in our experiments that
belong to 1995 Cuprite field scene in Nevada. The training
regions in the HSI data are identified heuristically from mineral maps provided by Clark et. al. [2]. The remote sensing data sets that we have used in our experiments come
from an Airborne Visible/Infrared Imaging Spectrometer
(AVIRIS) sensor image. AVIRIS is a unique optical sensor that delivers calibrated images of the upwelling spectral
radiance in 224 contiguous spectral channels (bands) with
wavelengths from 0.4-2.5 μm. AVIRIS is flown all across
the US, Canada and Europe.
Since, HSI imagery is highly correlated in the spectral
direction using the minimum noise fraction (MNF) transform is natural for dimensionality reduction. This transform is also called the noise adjusted principal component
(NAPC) transformation. Like principal component analysis

3.1

Merging of Modes

Merging is one of the processes in this proposed training
scheme wherein a single mode is created from two identical

408

ones. The closeness between the mixture modes is given
by a metric d. For example, consider two PDF’s p1 (x) and
p2 (x). Let there be collection of points near the central peak
of p1 (x) represented by xi ∈ X1 and another set of points
near the central peak of p2 (x) denoted by xi ∈ X2 . In which
case the closeness metric d is given by

d = log

p2 (xi )
xi ∈X1 p1 (xi )
xi ∈X1

p1 (xi )
xi ∈X2 p2 (xi )
xi ∈X2

which directly determines the number of mixture components. This kurtosis measure is given by
Ki =

wn,i =
(4)

n
wn,i ( Z√−μi )4
Σ

N
n=1

where

i

wn,i

−3

N (zn , μi , Σi )
N N (z , μ , Σ )
Σn=1
n
i
i

·

Therefore, if |Ki | is too high for any component (mode)
i, then the mode is split into two. This could be modified
to higher dimension by considering skew in addition to the
kurtosis, where each data sample Zn is projected on to the
j
j th principal axis of Σi in turn. Let zn,i
(Zn − μi ) Vij
where Vij is the j th column of V, obtained from the SVD
of Σi . Therefore, for each j

Notice that this metric is zero when p1 (x) = p2 (x) and
greater than zero for p1 (x) = p2 (x). A pre-determined
threshold is set to determine if the modes are too close to
each other. Since we assume that p1 (x) and p2 (x) are just
two Gaussian modes, it is easy to know where some good
points for X1 and X2 are. We choose the means (centers)
and then go one standard deviation in each direction along
all the principal axes. The principal axes are found by SVD
decomposition of R (the Cholesky factor of the covariance
matrix).
If the two modes are found to be too close, they will be
merged forming a weighted sum of two modes(weighted by
α1 , α2 ). The mean for this newly merged mode will be

•

j

Ki,j =
•

α1 μ1 + α2 μ2
(5)
α1 + α2
Here μ1 and μ2 are means of the components before
merging and μ is the resultant mean after merging of the
two components. The proper way to form a weighted combination of the covariances is not simply a weighted sum of
the covariances, which does not take in to account the separation of means. Therefore, one needs to implement a more
intelligent technique. Consider the Cholesky decomposition of the covariance matrix Σ = R R. It is possible to
consider the rows (P )R to be samples of P -dimensional
vectors whose covariance is Σ, where P√ the dimension.
is
1
The sample covariance is given by P ( P )2 R R = Σ.
Now, given the two modes to merge, we regard (P )R1
and (P )R2 as two populations to be joined. The sample
covariance of the collection of rows is the desired covariance. But this will assign equal weight to the two populations. To weight them with their respective weights, we
multiply them by α1α1 2 and α1α2 2 . Before they can
+α
+α

Zn,i 4
N
n=1 wn,i ( si )
N
n=1 wn,i

−3

j

ψi,j =

μ=

•

Zn,i 3
N
n=1 wn,i ( si )
N
n=1 wn,i

mi,j = |Ki,j | + |ψi,j |
where
s2 =
i

N
j
2
n=1 wn,i (zn,i )
.
N
n=1 wn,i

Now, if mi,j > τ , for any j, split mode i. Further, split
the mode by creating the modes at μ = μi + vi,j Si,j
and μ = μi − vi,j Si,j , where Si,j is the j th singular
value of Σi . The same covariance Σi is used for each
new mode. The decision to split or not also depends
upon the mixing proportion αi . The splitting does not
take place if the value of αi is too small. The optional
threshold parameter allows control over splitting. A
higher threshold is less likely to split.

3.3

Pruning of Modes

When the number of components becomes high they are
pruned out as the mixing weight αi falls. Pruning is killing
weak modes. This procedure ensures removal of weak
modes from the overall mixture. A weak mode is identified by checking αi with respect to certain threshold. Once
identified they are obliterated and further re-normalizing αi
such that i αi = 1. It is equally important that the algorithm does not annihilate many moderately weak modes all
at once. This is achieved by setting up two input threshold
values.

be joined, however, they must be shifted so they are rereferenced to the new central mean.

3.2

N
n=1

Splitting of Components

On the other hand if the number of components is too
low, then the components are split in order to increase the
total number of components. Vlassis et. al. [6] define
a method to monitor the weighted kurtosis of each mode

409

4 Minimum-Message Length Criteria
Let us now consider the second mixture learning technique based on the minimum message length (MML) criterion. This method is also known by the name of FigueredoJain algorithm [13]. Using the MML criterion and applying
it to mixture models leads to the following objective function
V
2

log(
{c:αc >0}

=

100
50
0
−50
−100
−150
−1000

−800

−600

−400
−200
0
MNF BAND 1

200

400

600

Figure 3. Intermediate learning step of GMM
before achieving convergence using the MML
criterion.

Cnz
N
N αc
)+
log +
12
2
12

Cnz (V + 1)
− log L(Z, λ)
2
where N is the number of training points, V is the
number of free parameters specifying a component, Cnz
is the number of components with non-zero weight in the
mixture (αc > 0), λ is the parameter list of the GMM
i.e. {α1 , μ1 , Σ1 , · · ·, αC , μC , ΣC }, and the last expression
log L(Z, λ) is the log-likelihood of the training data given
the distribution parameters λ.
The EM algorithm can be used to minimize the above
equation with fixed Cnz . This leads to the M-step with component weight updating formula
αi+1
c

150

N
V
n=1 wn,c ) − 2 }
C
N
V
j=1 max{0, ( n=1 wn,j ) − 2

200
150
100

MNF Band 2

∧(λ, Z) =

200

MNF BAND 2

Finally, once the number of modes settles out, then Q
stops increasing, and convergence is achieved. Hence, the
DCA technique for GMM approximation of HSI observations demonstrates that the combination of covariance constraints, mode pruning, merging and splitting can result in a
good PDF approximation of the HSI mixture models.

Outliers
Class 1
Class 2
Class 3
Miss−classification

50
0
−50
−100
−150
−200
−1000

−800

−600

−400

−200

0

MNF Band 1

200

400

600

800

Figure 4. Classification using MML learning
criterion.

max{0, (

}

This formula contains an explicit rule of annihilating
components by setting their weights to zero. The other
distribution parameters are updates similar to the previous
method. Figure 3 shows an intermediate step of mixture
learning using the MML criterion.

Having described the two learning criteria, we now evaluate their performance by using a simple Bayesian classifier. Figures 4,5 depict the classification of the three HSI
mixture classes based on the two learning criteria. Classification, outliers and miss-classifications are shown in these
figures for the two cases. Table 1 shows the classification
performance comparison of the DCA learning method vs.
the MML. The tabulation shows various classification and
miss-classification rates of the two learning methodologies
based on the amount of training data utilized.

410

200
150

MNF Band 2

5 Classification Results

250

100

Outliers
Class 1
Class 2
Class 3
Miss−Classification

50
0
−50
−100
−150
−200
−1000

−800

−600

−400

−200

0

200

MNF Band 1

400

600

800

1000

Figure 5. Classification using the DCA learning criterion.

plications for noice removal,” in IEEE Transactions on Geoscience and Remote Sensing, pp. 6574, Vol.26, No.1, 1988.

Table 1. Classification performance comparisons between the DCA learning criterion and
MML Learning criterion.
Training Data
(in percentage)
Utilized
55 %
60 %
65 %
70 %
75 %
80 %

Classification
(in percentage)
MML
DCA
68.7691 69.8148
73.5235 74.6569
72.6667 73.5882
74.54
75.1457
75.1541 74.3254
75.6765 75.5392

[6] N. Vlassis and A. Likas, “A kurtosis-based dynamic approach to Gaussian mixture modeling,”
IEEE Transactions on Systems, Man and Cybernetics, vol. 29, pp. 393399, 1999.

Miss-Class
(# of pixels)
MML DCA
1087
1054
957
970
647
606
1230
866
857
752
404
468

[7] N. Vlassis and A. Likas, “A greedy EM for Gaussian mixture learning,” Neural Processing Letters,
vol. 15, pp. 7787, 2002.
[8] G. F. Hughes, “On mean accuracy of statistitical
pattern recognizers”, IEEE Transactions on Information Theory, vol. 14, no. 1, pp. 55-63, 1968.
[9] D. A. Landgrebe, Signal Theory Methods in Multispectral Remote Sensing. ISBN 0-471-42028-X,
Wiley Inter-Science Publishers, 2003.

6 Conclusions
In this paper, we evaluated the performance of two
PDF mixture learning criterion for the reduced dimensionality material classification of Hyperspectral remote sensing data. The results show that the two methods have nearly
identical classification performance. The outcome of this
paper presents a possible integration of advanced data analysis and modeling tools to scientists, advancing the stateof-the-practice in the utilization of satellite image data to
various types of Earth System Science studies.

[10] B. S. Everitt and D. J. Hand, Finite Mixture Distributions. London: Chapman and Hall, 1981.
[11] T. Hastie, R. Tibshirani and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference and Prediction, Springer, 2001.
[12] A. P. Dempster, N. M. Laird, and D. B. Rubin,
“Maximum likelihood from incomplete data via
the EM algorithm,” Journal of the Royal Statistical Society, vol. 39B, No. 1, pp. 138, 1977.

References

[13] M. A. T. Figueiredo and A. K. Jain, “Unsupervised
Learning of Finite mixture models,” IEEE Trans.
Pattern Anal. Machine Intell., vol. 24, no. 3, pp.
381396, March 2002.

[1] D. G. Manolakis, and G. Shaw, “Detection Algorithms for Hyperspectral Imaging Applications,”
IEEE Signal Processing Magazine, Vol. 19, Issue
1, January 2002, ISSN 1053-5888.

[14] A. K. Jain and R. Dubes, Algorithms for Clustering Data. Englewood Cliffs, New Jersey: Prentice
Hall, 1988.

[2] R. N. Clark, A. J. Gallagher, and G. A. Swayze,
“Material absorption band depth mapping of imaging spectrometer data using a complete band
shape least-squares fit with library reference spectra”, Proceedings of the Second Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) Workshop, JPL Publication 90-54, pp. 176-186, 1990.

[15] B. H. Huang, “Maximum Likelihood estimation
for mixture multivariate stochastic observations of
Markov chains”, AT&T Technical Journal,vol. 64,
No. 6, pp. 1235-1249, 1985.
[16] L. Liporace, “Maximum likelihood estimation
for multivariate observations of Markov sources,”
IEEE Transactions on Information Theory, vol. 28,
Issue 5, pp. 729-734, 1982.

[3] G. J. McLachlan and K. E. Basford, Mixture Models: Inference and Applications to Clustering. New
York: M. Dekker, 1988.
[4] G. J. McLachlan and D. Peel, Finite Mixture Models. New York: Wiley, 2000.

[17] A.K. Jain, R. Duin and J. Mao, Statistical pattern
recognition: A review, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22,
no. 1, pp. 4-38, 2000.

[5] A.A. Green, M. Berman, P. Switzer and
M.D. Craig, “A transformation for ordering multispectral data in terms of image quality with im-

411

2009 asilomar

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (9)

Similar a 2009 asilomar

Similar a 2009 asilomar (20)

Más de Pioneer Natural Resources

Más de Pioneer Natural Resources (7)

Último

Último (20)

2009 asilomar