Application of Fisher LDA to Classify Speech and Music

Application of Fisher Linear Discriminant Analysis
to Speech/Music Classification
Enrique Alexandre, Manuel Rosa, Lucas Cuadra, and Roberto Gil-Pita
Departamento de Teor´ıa de la Señal y Comunicaciones
Universidad de Alcalá. 28805 - Alcalá de Henares,
Madrid, Spain
Presented By:
S. Lushanthan

Agenda
 Objective
 Time Frequency Decomposition
 Feature Extraction
 Classification Algorithms
 Data Collection
 Results and Discussion

Objective
 The well-known K-N-N algorithm has been widely used in many sound
classification applications. The Objective here is to,
“Demonstrate the superior behavior of the Fishers Linear Discriminant
algorithm compared to the K-Nearest-Neighbor algorithm”
Why Speech/ Music Classification?
 Fisher LDA Classifier has not been tried much in the domain of speech/ audio
classification
 If this succeeds, this would be a first-step in many Music-Genre Classification
systems

Signal Processing – Time frequency
decomposition

Feature Extraction
 Literature says that features can be classified in to 3 different classes,
1. Timbre-related
2. Rhythm related
3. Pitch-related
 For simplification purposes only “timbre-related” features are used
 A 512-samples window is used, with no overlap between adjacent frames
 The time-frequency decomposition is performed using either a Modified
Discrete Cosine Transform (MDCT), or a Discrete Fourier Transform (DFT)
 All the features are calculated and their mean and standard deviation are
computed every 43 frames (1.85 seconds at our sampling rate). Thus a 2-
dimensional vector, containing the mean and standard deviation
computed every 43 frames

Feature Description Mathematical Equation
Spectral Centroid Measure of brightness of a sound
Spectral Roll-off Shape of the spectrum
Zero Crossing Rate (ZCR) How noisy a signal is
High Zero Crossing Rate Ratio # of frames whose ZCR is 1.5x above
the mean ZCR
Short-Time Energy (STE) Mean energy of the signal within each
analysis frame
Low Short-Time Energy Ratio Ratio of frames whose STE is 0.5x below
the mean STE
Mel-frequency Cepstral
Coefficients (MFCC)
Provide a compact representation of
the spectral envelope
Voice2White measure of the energy inside the
typical speech band (300-4000 Hz)
respect to the whole energy of the
signal
Activity Level calculated using method for the
objective measurement of active
speech

Classification Algorithms
K- Nearest- Neighbor
 Classification Rule
Assume that we have a training set with L vectors grouped into C different classes. To
obtain the class corresponding to a new observed vector X, the algorithm has simply
to look for the K nearest neighbors to the test vector X, and weigh their class
numbers they belong to, usually using a majority rule.

Fisher LDA
 Data are projected onto a line, and the classification is performed in this one-
dimensional space
 The class separability function in a direction w є Rn is defined as:
 Find an analytic expression for w which maximizes J(w):
SB and SW are the between-class and
within class scatter matrixes respectively

Data Collection
 Corpus for speech/music classification provided by Dan Ellis originally recorded by
Eric Scheirer during his internship at Interval Research Corporation

“Music-Speech” Corpus
Training
Data
music
(60 files)
speech
(60 files)
m + s
(60 files)
Test
Data
speech
(without bgm)
(120 files)
music with
no vocals
(126 files)
music with
vocals
(120 files)
45 minutes,
15 seconds
each
15.25 minutes,
2.5 seconds
each

Results and Discussion
 Fisher LDA, 1-N-N, 3-N-N for all features individually
 Probability of Error
Feature Fisher 1-NN 3-NN
Centroid (MDCT) 8.74% 17.48% 21.85%
Centroid (DFT) 16.66% 29.23% 30.60%
Roll-off (MDCT) 14.48% 25.40% 21.85%
Roll-off (DFT) 8.19% 13.11% 13.11%
ZCR 9.83% 19.67% 18.03%
HZCRR 25.13% 39.89% 36.33%
STE 48.63% 22.40% 22.67%
LSTER 11.74% 33.87% 23.77%
MFCC 4.09% 22.13% 26.50%
Voice2White 4.91% 6.28% 6.01%
Activity level 12.84% 18.03% 18.85%
Combination of two or more
of these features does not
seem to improve the results.
e.g:
MFCC and the Voice2White
features with a Fisher linear
discriminant classifier, leads to a
probability of error equal to
4.09%,the same with MFCC alone

Confusion matrixes using the Voice2White
feature
Classifier Speech Music
Fisher
Speech 104 16
Music 2 244
1-N-N
Speech 114 6
Music 17 229
3-N-N Speech 116 4
Music 18 228
Fisher LDA has
high probability
of error when the
input is Speech
K-N-N has high
probability of
error when the
input is Music
So Why not combine classifiers using
Majority Rule for better results?
Probability of Error drops to 4.5%

Conclusion
 Fisher linear discriminant analysis can provide very promising results using
only one feature for the classification
 Better results may be obtained combining the results obtained from two or
more classifiers

Application of Fisher LDA to Classify Speech and Music

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Application of Fisher LDA to Classify Speech and Music

Similar to Application of Fisher LDA to Classify Speech and Music (20)

Recently uploaded

Recently uploaded (20)

Application of Fisher LDA to Classify Speech and Music

Editor's Notes