This document describes research applying Fisher Linear Discriminant Analysis (LDA) and K-Nearest Neighbors (K-NN) algorithms to classify speech and music audio clips. It finds that Fisher LDA using single features like mel-frequency cepstral coefficients achieves classification error rates below 5%, outperforming K-NN. While combining multiple features does not improve LDA results, combining the outputs of LDA and K-NN classifiers using majority voting further lowers the error rate to 4.5%, demonstrating the benefit of classifier ensembles for this task.
Application of Fisher LDA to Classify Speech and Music
1. Application of Fisher Linear Discriminant Analysis
to Speech/Music Classification
Enrique Alexandre, Manuel Rosa, Lucas Cuadra, and Roberto Gil-Pita
Departamento de Teor´ıa de la Se˜nal y Comunicaciones
Universidad de Alcal´a. 28805 - Alcal´a de Henares,
Madrid, Spain
Presented By:
S. Lushanthan
2. Agenda
Objective
Time Frequency Decomposition
Feature Extraction
Classification Algorithms
Data Collection
Results and Discussion
3. Objective
The well-known K-N-N algorithm has been widely used in many sound
classification applications. The Objective here is to,
“Demonstrate the superior behavior of the Fishers Linear Discriminant
algorithm compared to the K-Nearest-Neighbor algorithm”
Why Speech/ Music Classification?
Fisher LDA Classifier has not been tried much in the domain of speech/ audio
classification
If this succeeds, this would be a first-step in many Music-Genre Classification
systems
5. Feature Extraction
Literature says that features can be classified in to 3 different classes,
1. Timbre-related
2. Rhythm related
3. Pitch-related
For simplification purposes only “timbre-related” features are used
A 512-samples window is used, with no overlap between adjacent frames
The time-frequency decomposition is performed using either a Modified
Discrete Cosine Transform (MDCT), or a Discrete Fourier Transform (DFT)
All the features are calculated and their mean and standard deviation are
computed every 43 frames (1.85 seconds at our sampling rate). Thus a 2-
dimensional vector, containing the mean and standard deviation
computed every 43 frames
6. Feature Description Mathematical Equation
Spectral Centroid Measure of brightness of a sound
Spectral Roll-off Shape of the spectrum
Zero Crossing Rate (ZCR) How noisy a signal is
High Zero Crossing Rate Ratio # of frames whose ZCR is 1.5x above
the mean ZCR
Short-Time Energy (STE) Mean energy of the signal within each
analysis frame
Low Short-Time Energy Ratio Ratio of frames whose STE is 0.5x below
the mean STE
Mel-frequency Cepstral
Coefficients (MFCC)
Provide a compact representation of
the spectral envelope
Voice2White measure of the energy inside the
typical speech band (300-4000 Hz)
respect to the whole energy of the
signal
Activity Level calculated using method for the
objective measurement of active
speech
7. Classification Algorithms
K- Nearest- Neighbor
Classification Rule
Assume that we have a training set with L vectors grouped into C different classes. To
obtain the class corresponding to a new observed vector X, the algorithm has simply
to look for the K nearest neighbors to the test vector X, and weigh their class
numbers they belong to, usually using a majority rule.
8. Fisher LDA
Data are projected onto a line, and the classification is performed in this one-
dimensional space
The class separability function in a direction w є Rn is defined as:
Find an analytic expression for w which maximizes J(w):
SB and SW are the between-class and
within class scatter matrixes respectively
9. Data Collection
Corpus for speech/music classification provided by Dan Ellis originally recorded by
Eric Scheirer during his internship at Interval Research Corporation
10. “Music-Speech” Corpus
Training
Data
music
(60 files)
speech
(60 files)
m + s
(60 files)
Test
Data
speech
(without bgm)
(120 files)
music with
no vocals
(126 files)
music with
vocals
(120 files)
45 minutes,
15 seconds
each
15.25 minutes,
2.5 seconds
each
11. Results and Discussion
Fisher LDA, 1-N-N, 3-N-N for all features individually
Probability of Error
Feature Fisher 1-NN 3-NN
Centroid (MDCT) 8.74% 17.48% 21.85%
Centroid (DFT) 16.66% 29.23% 30.60%
Roll-off (MDCT) 14.48% 25.40% 21.85%
Roll-off (DFT) 8.19% 13.11% 13.11%
ZCR 9.83% 19.67% 18.03%
HZCRR 25.13% 39.89% 36.33%
STE 48.63% 22.40% 22.67%
LSTER 11.74% 33.87% 23.77%
MFCC 4.09% 22.13% 26.50%
Voice2White 4.91% 6.28% 6.01%
Activity level 12.84% 18.03% 18.85%
Combination of two or more
of these features does not
seem to improve the results.
e.g:
MFCC and the Voice2White
features with a Fisher linear
discriminant classifier, leads to a
probability of error equal to
4.09%,the same with MFCC alone
12. Confusion matrixes using the Voice2White
feature
Classifier Speech Music
Fisher
Speech 104 16
Music 2 244
1-N-N
Speech 114 6
Music 17 229
3-N-N Speech 116 4
Music 18 228
Fisher LDA has
high probability
of error when the
input is Speech
K-N-N has high
probability of
error when the
input is Music
So Why not combine classifiers using
Majority Rule for better results?
Probability of Error drops to 4.5%
13. Conclusion
Fisher linear discriminant analysis can provide very promising results using
only one feature for the classification
Better results may be obtained combining the results obtained from two or
more classifiers