Más contenido relacionado
La actualidad más candente (18)
Similar a Improving the global parameter signal to distortion value in music signals (20)
Improving the global parameter signal to distortion value in music signals
- 1. International Journal of Electronics and Communication Engineering & Technology (IJECET), ISSN
INTERNATIONAL JOURNAL OF ELECTRONICS AND
0976 – 6464(Print), ISSN 0976 – 6472(Online) Volume 4, Issue 1, January- February (2013), © IAEME
COMMUNICATION ENGINEERING & TECHNOLOGY (IJECET)
ISSN 0976 – 6464(Print)
ISSN 0976 – 6472(Online)
Volume 4, Issue 1, January- February (2013), pp. 01-10
IJECET
© IAEME: www.iaeme.com/ijecet.asp
Journal Impact Factor (2012): 3.5930 (Calculated by GISI)
©IAEME
www.jifactor.com
IMPROVING THE GLOBAL PARAMETER SIGNAL TO DISTORTION
VALUE IN MUSIC SIGNALS USING PANNING TECHNIQUE AND
DISCRETE WAVELET TRANSFORMS
VENKATESH KUMAR.N1, RAGHAVENDRA.N2, SUBASH KUMAR T. G3, MANOJ
KUMAR.K4
1(SET, Asst. Professor, Department of ECE, Jain University, Jakkasandra, Ramanagar Taluk,
Karnataka,India kumarsparadise@yahoo.com,)
2(Principal Staff Engineer, Google Inc, [formerly Motorola Mobility], Bangalore, India)
3(Project Leader, Jasmin Infotech Pvt Ltd, Velacherry, Chennai, India)
4(Manoj Kumar K, Consultant, Java Mentor, Bangalore, India)
ABSTRACT
In this paper, an attempt is made to alleviate the effect of distortion during feature
extraction of a music signal. The proposed method is compared with the existing methods for
performance evaluation, thereby, improving the signal to distortion value.
Keywords: Blind Source Separation; DWT; FFT; Panning; STFT; Signal to Distortion ratio;
1. INTRODUCTION
The singing voice, in addition to being the oldest musical instrument, is also one of
the most complex from an acoustic standpoint [1]. Research on the perception of singing is
not as developed as in the closely related field of speech research [2]. Some of the existing
work is surveyed in this section.
Chou and Gu [3] had utilized a gaussian mixture model (GMM) to detect the vocal
regions. The feature vectors used for the GMM include 4Hz modulation energy, harmonic
coefficients, 4 Hz harmonic coefficients, delta mel frequency cepstral coefficients (MFCC)
and delta log energy.
Berenzweig and Ellis [4] had used a speech recognizer’s classifier to distinguish vocal
segments from accompaniment.
1
- 2. International Journal of Electronics and Communication Engineering & Technology (IJECET), ISSN
0976 – 6464(Print), ISSN 0976 – 6472(Online) Volume 4, Issue 1, January- February (2013), © IAEME
Kim and Whitman [5] had developed a system for singer identification in popular
music recordings using voice coding features.
Another system for automatic singer identification had been proposed by Zhang[6].
Maddage(c) et al. [7] had proposed a framework for music structure analysis with the
help of repeated chord pattern analysis and vocal content analysis.
MaximoCobos [8] had proposed a system for extracting singing voice from stereo
recordings. This system combines panning information and pitch tracking, allowing to refine
the time-frequency mask applied for extracting a vocal segment, and thus, improving the
separation.
1.1 Motivation
In real time applications of sound separation like lyrics recognition and music
remixing, music information retrieval requires accurate extraction of features from the music
signal. Existing methods result in poor signal to distortion value. Hence, it is necessary to
enhance music quality by improving the signal to distortion value.
1.2 Problem Statement
The applications of music separation algorithms in real time demand for better signal
to noise and distortion ratios (SINAD). These parameters depend on the technique used for
feature extraction, where in literature it can be found that similarity measures between the
Short Time Fourier Transforms of the input signals were used to identify the time-frequency
(TF) regions occupied by each source based on the panning coefficient. Instead, in this work,
we implement the audio source separation using the similarity measures between the Discrete
Wavelet Transforms (DWT’s) of the input signals which were used to identify the time-
frequency regions occupied by each source based on the panning coefficient, hence
improving the Signal to and Distortion ratio.
2. PROPOSED SOURCE SEPARATION TECHNIQUE
2.1 Music Source Separation Model
The source separation problem can be stated as follows: given M linear mixtures of N
sources mixed via an unknown M × N mixing matrix A, estimate the underlying sources from
the mixtures. When M = N, this can be achieved by estimating an un-mixing matrix W,
which allows to estimate the original sources up to a permutation and a scale factor.
Independent Component Analysis (ICA) algorithms are able to perform the separation if
some conditions are satisfied: the sources must be non-Gaussian and statistically independent
[9]. Moreover, the number of sources must be equal to the number of available mixtures, M =
N, and the problem is said to be even determined. When M > N, the mixing process is
defined as over determined and the underlying sources can be estimated by least-squares
optimization using matrix pseudo-inversion. If M < N, the mixing process is underdetermined
and the estimation of the sources becomes much more difficult [10]. When dealing with
stereo commercial music recordings, only the information of the left and right channels is
available, and thus, the mixture is generally underdetermined [11]. Sparse methods provide a
powerful approach to the separation of several signals when there are more sources than
sensors [12]. The sparsity property of audio signals means that in most time-frequency bins,
all sources but one, at most, will have a time-frequency coefficient of zero or close to
zero[13][14]. The DUET algorithm [15], originally conceived for separating under
determined speech mixtures, assumes that because of the sparsity of speech in the Short Time
2
- 3. International Journal of Electronics and Communication Engineering & Technology (IJECET), IS ISSN
0976 – 6464(Print), ISSN 0976 – 6472(Online) Volume 4, Issue 1, January- February (2013) © IAEME
2013),
Fourier Transform (STFT) domain, almost all mixture time-frequency points with significant
time frequency
magnitude are in fact due to only one of the original sources. In fact, in the ideal case when
e
each time-frequency point belongs only to one source, the sources are said to be W
frequency W-Disjoint
Orthogonal (W-DO).
2.2 Overview of the Proposed Model
Define Similarity
X1 X2
Stereo input Measure
Partial Similarity measure is
calculated
Calculate Ambiguity Resolving Function
Panning Index Analysis
and
Gaussian Windowing
Set the window width ζ Foreground Streams
DWTs of the foreground streams
are obtained
Apply DWT−1operator, obtaining
target signal ( (t)), i=1,2
Figure 1 overview of the proposed separation model
3
- 4. International Journal of Electronics and Communication Engineering & Technology (IJECET), IS ISSN
0976 – 6464(Print), ISSN 0976 – 6472(Online) Volume 4, Issue 1, January- February (2013) © IAEME
2013),
2.3 Panning Index Windowing
An initial segregation of the singing voice is used by applying the source
identification technique developed by Avendano [16]. This technique is based on a
comparison of the left and right signals in the TF plane, obtaining a two-dimensional map
two dimensional
that identifies different source components related to the panning gains used in the stereo mix
panning
down. Firstly, a similarity measure is defined:
(1)
where * denotes a complex conjugation. If the source is panned to the center
then the function will get its maximum value of one, and if the source
is panned completely to either side, the function will attain its minimum value of zero. A
mpletely
quadratic dependence on the panning knob Φ makes the function (4.2) multi-valued and an
valued
ambiguity appears in knowing the lateral direction of the source. The ambiguity is resolv
resolved
using the following partial similarity measures:
(2)
and their difference
(3)
The ambiguity-resolving function is:
resolving
(4)
Finally, the panning index Ψ(k,m) is obtained as
(k,m)
. (5)
which identifies the time-frequency components of the sources in the stereo mix when they
frequency
are all panned to different positions.
If several sources are equally panned, they will appear in the PI map as a single
source. Due to the overlap with other sources, selecting only bins with Ψ = Ψ0 will exclude
0
bins where the source might still have significant energy but whose panning index has been
altered by the presence of the interference. A Gaussian window is proposed to let components
with values equal to Ψ0 pass unmodified and weight TF points with a PI value near to Ψ0:
(6)
where Ψ0 is the panning index value for extracting a given source, ζ controls the width of
the window, and ‘v’is a floor value necessary to avoid setting DWT values to zero, which
noise artifacts. The Ψ0 value must be specified for centering the
might result in musical-noise arti
separating window. Most of the vocal removers exploit the fact that singing voice is usually
panned to the center. This is true for most of music recordings, so Ψ0 = 0 is normally used. A
supervised exploration along different PI values can be used for locating more exactly the pan
4
- 5. International Journal of Electronics and Communication Engineering & Technology (IJECET), IS ISSN
0976 – 6464(Print), ISSN 0976 – 6472(Online) Volume 4, Issue 1, January- February (2013) © IAEME
2013),
location of the vocals. The value of ζ, used for setting up the window width, can be obtained
,
using
(7)
where Ψcis the PI value where the window reaches a small value A, for example
is , = −60
dB.
Once the parameters of the window have been set up, the DWTs of the initial
foreground streams are simply obtained by applying the window to each of the mixture
channels:
(8)
These are converted back to the time domain applying the DWT−1operator, obtaining (t).
The denotes the corresponding step of the separation method. The recovered
target signal is obtained by adding the foreground streams of both channels:
(9)
2.4 Performance Evaluation
Separation algorithms can be evaluated by using a set of measures under some
allowed distortions. These distortions depend on the kind of application considered. In [17],
[1
four numerical performance criteria are defined. The Signal to Distortion Ratio
Th
(10)
the Signal to Interferences Ratio
(11)
the Signal to Noise Ratio
(12)
and the Signal to Artifacts Ratio
(13)
5
- 6. International Journal of Electronics and Communication Engineering & Technology (IJECET), IS ISSN
0976 – 6464(Print), ISSN 0976 – 6472(Online) Volume 4, Issue 1, January- February (2013) © IAEME
2013),
where is a version of modified by an allowed distortion, and where ,
and respectively are the interferences, noise and artifacts error terms resulting from
the decomposition
(14)
The SIR and the SAR are indicators of the rejection of the interferences and th the
absence of “burbling” artifacts, respectively. The SNR is a measure of the rejection of the
sensor noise and the SDR can be seen as a global performance measure.
3. IMPLEMENTATION
The model discussed above is implemented in MATLAB R2010a software and BSS
EVAL toolbox [18] for MATLAB is used for performance evaluation. Later in this section
]
the extracted features using our method are compared with other two feature extraction
techniques STFT and FFT.
3.1 Design parameters for Source separation
Table 1
Design parameters for s
source separation
Parameter Value
Frame-size
Frame 1000
Overlap 0.75%
Panning Index(C) -
Panning Index(0) 0
Smallest window
0.001
Value (A)
Floor value 0.0005
By considering the design parameters as mentioned in the TABLE 1, we calculate the
,
performance evaluation parameter SDR for the proposed model.
To evaluate the extracted features they were compared in two classification
experiments with two feature sets that have been proposed in the literature. The first feature
set consists of features extracted using the STFT. The second feature set consists of features
ts
extracted from Fast Fourier Transform (FFT)
The source separation method is applied over several wave files, where each audio
file is approximately 30 seconds long, with a frame size of 1000 samples at 44100 Hz
sampling rate.
6
- 7. International Journal of Electronics and Communication Engineering & Technology (IJECET), ISSN
0976 – 6464(Print), ISSN 0976 – 6472(Online) Volume 4, Issue 1, January- February (2013), © IAEME
4. RESULTS
Similarity measure between the Discrete Wavelet Transforms (DWT) of the input
signals is used to identify time-frequency regions occupied by each source based on the
panning coefficient assigned to it during the mix. Individual music components are identified
and manipulated by clustering the time-frequency components with a given panning
coefficient. After modification, an inverse IDWT is used to synthesize a time-domain
processed signal. The Figures below shows the plot of both the input signal and the extracted
voice signal of several wave files plotted using MATLAB.
The performance evaluation parameter, SDR can be obtained using BSS EVAL
toolbox in MATLAB. The music separation method is applied over several wave files, where
each audio file is approximately 30 seconds long, with a frame size of 1000 samples at 44100
Hz sampling rate. Below are the lists of experiments.
Experiment 1:
The Fig. 2 demonstrates input signal and the extracted voice signal from the wave file
“boyfriend.wav”, composed by Ashley simpson which is 20 seconds long, with 44100 Hz
sampling rate. MATLAB software is used to plot the results. The results are tabulated in
TABLE 2.
Figure 2.input signal and the extracted voice signal from the wave file “boyfriend.wav”
Table 2
SDR’s of “boyfriend.wav”
SDR
I/P Wave file/Composer
FFT STFT DWT
Boyfriend -Ashley simpson 44.1153 51.1288 84.0558
7
- 8. International Journal of Electronics and Communication Engineering & Technology (IJECET), ISSN
0976 – 6464(Print), ISSN 0976 – 6472(Online) Volume 4, Issue 1, January- February (2013), © IAEME
Experiment 2:
The Fig. 3 demonstrates input signal and the extracted voice signal from the wave file
“chammak challo.wav”, composed by Vishal Shekhar which is 35 seconds long, with 44100
Hz sampling rate. MATLAB software is used to plot the results. The results are tabulated in
TABLE 3.
Figure 3.i/p signal and the extracted voice signal from the wave file “chammakchallo.wav”
Table 3
SDR’s of “chammakchallo.wav”
SDR
I/P Wave file/Composer
FFT STFT DWT
Chammakchallo - vishalshekar 35.5603 40.5822 83.47656607
Experiment 3:
The Fig. 4 demonstrates input signal and the extracted voice signal from the wave file
“toxic.wav”, composed by Britney Spears which 27 seconds long, with 44100 Hz sampling
rate. MATLAB software is used to plot the results. The results are tabulated in TABLE 4.
Figure 4 input signal and the extracted voice signal from the wave file “toxic.wav”
Table 4
SDR’s of “toxic.wav”
SDR
I/P Wave file/Composer
FFT STFT DWT
Toxic - Britney spears 56.2875 59.7563 85.7679
8
- 9. International Journal of Electronics and Communication Engineering & Technology (IJECET), ISSN
0976 – 6464(Print), ISSN 0976 – 6472(Online) Volume 4, Issue 1, January- February (2013), © IAEME
5. CONCLUSION
Audio source separation using DWT is presented. The Discrete Wavelet Transforms
(DWT’s) of the input signals which were used to identify the time-frequency regions
occupied by each source based on the panning coefficient hence improved the Signal to Noise
and Distortion ratios. From the TABLE 2, 3 and 4 respectively, it is evident that the results
obtained using DWT as feature extractor is approximately 38% better when compared with
other two feature extractors, proving that proposed method provides better Signal to Noise
and Distortion Ratios.
6. ACKNOWLEDGEMENTS
The authors 1 and 4 wish to acknowledge for the awesome technical support provided
by Jasmin Infotech,India and Google India.
REFERENCES
[1]. Kim,Y. and Whitman, B. “Singer identification in popular music recordings using voice
coding features,” Proc. ISMIR 2002.
[2] P. Comon, Independent component analysis, a new concept?, SignalProcessing, vol. 36,
no. 3, pp. 287–314, April 1994.
[3]. Chou, W. and Gu, L. “Robust singing detection in speech/music discriminator design,”
Proc. ICASSP 2001.
[4]. Berenzweig, A. and Ellis, D.P.W. “Locating Singing voice segments within music
signals ,” Proc. WASPAA 2001.
[5]. Kim,Y. and Whitman, B. “Singer identification in popular music recordings using voice
coding features,” Proc. ISMIR 2002.
[6]. Zhang, T. “System and method for automatic singer identification,” Proc. ICME 2003.
762 Proc.
[7]. Maddage, N.C.(c), et al. “Content-based music structure analysis with applications to
music semantic understanding,” Proc. ACM Multimedia 2004.
[8]. MaximoCobos, and Jose J. Lopez, “Singing Voice Separation Combining Panning
Information and Pitch Tracking”, Audio Engineering Society Convention Paper Presented at
the 124th Convention 2008 May 17–20 Amsterdam, The Netherlands
[9] J. F. Cardoso, “Blind signal separation: statisticalprinciples,” in Proceedings of the
IEEE,vol.86, no. 10, pp. 2009-2025,
[10] T. W. Lee, M. S. Lewicki, M. Girolami and T.J. Sejnowski, “Blind source separation of
moresources than mixtures using overcomplete representations,”in IEEE Signal Processing
Letters,vol.6, no. 4, pp.87-90, April 1999.
[11] A. S. Master, “ Stereo Music Source Separationvia Bayesian Modeling,” Ph.D.
Dissertation,Stanford University, June 2006.
[12] P. D. O’Grady, B. A. Pearlmutter and S.T. Rickard, “Survey of sparse and non-
sparsemethods in source separation,” InternationalJournal of Imaging Systems and
Technology(IJIST), vol.15, no. 1, pp.18-33, 2005.
[13] C. Jutten and M. Babaie-Zadeh, “Source separationprinciples, current advances and
applications,”presented at the 2006 German-FrenchInsitute for Automation and Robotic
AnnualMeeting, IAR 2006, Nancy, France, November2006.
9
- 10. International Journal of Electronics and Communication Engineering & Technology (IJECET), ISSN
0976 – 6464(Print), ISSN 0976 – 6472(Online) Volume 4, Issue 1, January- February (2013), © IAEME
[14] K. Torkkola, “Blind separation for audio signals:are we there yet?,” in Proceedings of
theWorkshop on Independent Component Analysisand Blind Signal Separation (ICA 1999),
1999
[15] O. Yilmaz and S. Rickard, “Blind separationof speech mixtures via time-frequency
masking,”in IEEE Transactions on Signal Processing,vol.52, no. 7, pp.1830-1847, July 2004.
[16] C. Avendano, “Frequency-domain source identificationand manipulation in stereo mixes
forenhancement, suppression and re-panning applications,”in IEEE Workshop on
Applicationsof Signal Processing to Audio and Acoustics,New Paltz, New York, October
2003.
[17] E. Vincent, R. Gribonval and C. F´evotte, “PerformanceMeasurement in Blind Audio
SourceSeparation,” in IEEE Transactions on Speechand Audio Processing, vol.14, no. 4,
pp.1462-1469, 2006.
[18] C. F´evotte, R. Gribonval and E. Vincent,“BSS EVAL Toolbox User Guide,”
IRISA,Rennes, France, 2006.
[19] Ravindra M. Malkar, Vaibhav B. Magdum and Darshan N. Karnawat, “An Adaptive
Switched Active Power Line Conditioner Using Discrete Wavelet Transform (Dwt)”
International Journal of Electrical Engineering & Technology (IJEET), Volume2, Issue1,
2011, pp. 14 - 24, Published by IAEME.
10