Daichi Kitamura, "Blind audio source separation based on time-frequency structure models," Invited Overview Session in Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC 2021), Tokyo, Japan, December 2021.
"Exploring the Essential Functions and Design Considerations of Spillways in ...
Blind audio source separation based on time-frequency structure models
1. 13th Asia Pacific Signal and Information Processing Association
Annual Summit and Conference (APSIPA ASC 2021)
Overview Session OS-1: Acoustic Signal Processing
Blind Audio Source Separation Based
on Time-Frequency Structure Models
Daichi Kitamura
National Institute of Technology, Kagawa College
Japan
2. 2
• Daichi Kitamura
• National Institute of Technology, Kagawa College
• Research interests
– Audio source separation
– Array signal processing
– Machine learning
– Music signal processing
– Biosignal processing
Self introduction
3. 3
Contents
• Background
– Blind source separation (BSS) for audio signals and its history
– Motivation
• Preliminaries
– Frequency-domain independent component analysis (FDICA)
– Independent vector analysis (IVA)
– Independent low-rank matrix analysis (ILRMA)
• Time-frequency-masking-based BSS (TFMBSS)
– Reformulation of BSS problems and its optimization
– BSS based on primal-dual splitting method
– Interpretation of TF masking and application
• Conclusion
4. 4
Contents
• Background
– Blind source separation (BSS) for audio signals and its history
– Motivation
• Preliminaries
– Frequency-domain independent component analysis (FDICA)
– Independent vector analysis (IVA)
– Independent low-rank matrix analysis (ILRMA)
• Time-frequency-masking-based BSS (TFMBSS)
– Reformulation of BSS problems and its optimization
– BSS based on primal-dual splitting method
– Interpretation of TF masking and application
• Conclusion
5. 5
• Blind source separation (BSS) for audio signals
– estimates specific audio sources in the observed mixture
– does not require prior information of recording conditions
• locations of mics and sources, room geometry, timbres, etc.
• The word “blind” means “unsupervised”.
– is available for many audio applications
• Hearing aid systems
• Automatic speech recognition (ASR)
• Preprocessing for music analysis etc.
Background: BSS for audio signals
Observed mixture
BSS
Estimated source signals
6. 6
Background: BSS for audio signals
• Music BSS using ILRMA
Guitar
Vocal
Keyboard
Guitar
Vocal
Keyboard
BSS
Please pay attention to listen
three parts in the mixture.
MATLAB code: https://github.com/d-kitamura/ILRMA
Python code: Implemented in “Pyroomacoustics” library
7. 7
• Numbers of mics and sources
• Consider only “determined” situation
– # of mics # of sources
– BSS estimates “demixing system” (inverse of mixing)
Background: BSS for audio signals
Source signals Observed signals Estimated signals
Mixing system Demixing system
Monaural rec.
1ch
Single-channel signal Mic array
1ch
Mch
Multichannel signal
2ch
…
…
8. 8
Spectral subtraction
Time-frequency masking
Many other methods
Beamforming
Sparse coding
Time-frequency masking
DOA clustering
Many other methods
Historical overview (only the methods related in this talk)
1994
1998
2013
1999
2012
Permutation solvers
Extension of models
Generative models
Frequency-domain ICA
Itakura-Saito NMF
IVA
2016
2009
2006
2011 AuxIVA
Time-varying IVA
Multichannel NMF
2018 IDLMA
Single-channel
Spatial covariance model
Spatial covariance
model+DNN
Supervised approaches
based on deep neural
networks (DNN)
ICA
[Comon], [Bell and Sejnowski],
[Cardoso], [Amari], [Cichocki], …
[Smaragdis]
[Saruwatari], [Murata],
[Morgan], [Sawada], …
[Hiroe], [Kim]
[Ono]
[Ono]
[Kitamura]
[Nugraha]
[Ozerov, Sawada]
[Duong]
[Févotte]
[Lee]
[Virtanen], [Smaragdis],
[Kameoka], [Ozerov], …
2010
Underdetermined
Determined
[Yatabe&Kitamura]
2021
Time-freq.-masking-
based BSS (TFMBSS)
[Mogami]
NMF
ILRMA
Gray-colored methods
are “supervised”
(not fully blind)
9. 9
Motivation of determined BSS
• Conventional BSS: IVA, AuxIVA, and ILRMA
– Minimum distortion (linear demixing)
– Relatively fast and stable optimization
• Iterative projection (AuxIVA) [Ono+, 2010], [Ono, 2011]
– Time-frequency (TF) structure model affects performance
• IVA: co-occurrence along frequency axis
• ILRMA: NMF-based low-rank time-frequency structure
– Optimization algorithm depends on the TF model
• Difficult to derive update rules
• Easily replace TF model and search the best one
– Time-frequency-masking-based BSS (TFMBSS)
: frequency bins
Observed
signal
Source signals
Frequency-wise mixing matrix
: time frames
Estimated
signal
Frequency-wise demixing matrix
[Yatabe & Kitamura, 2021]
10. 10
Contents
• Background
– Blind source separation (BSS) for audio signals and its history
– Motivation
• Preliminaries
– Frequency-domain independent component analysis (FDICA)
– Independent vector analysis (IVA)
– Independent low-rank matrix analysis (ILRMA)
• Time-frequency-masking-based BSS (TFMBSS)
– Reformulation of BSS problems and its optimization
– BSS based on primal-dual splitting method
– Interpretation of TF masking and application
• Conclusion
11. 11
Independence-based BSS in time domain
• Independent component analysis (ICA) [Comon, 1994]
– If we assume
– then we can estimate demixing matrix
• by maximizing independence between the estimates ( and )
Mixing matrix
Sources
(latent components)
1. Mutually
independent
2. Non-Gaussian
3. Invertible and
time-invariant
Mixtures
(observed signals)
Inverse matrix
12. 12
• Independent component analysis (ICA) [Comon, 1994]
– Maximizes independence between source distributions
– Optimization problem in ICA
Independence-based BSS in time domain
Minimize
similarity
: Non-Gaussian source distribution
(e.g., Laplace distribution)
...
13. 13
Independence-based BSS in time domain
• Independent component analysis (ICA) [Comon, 1994]
– However,
• 1. Signal scales (volumes) cannot be determined
• 2. Signal permutation cannot be determined
Sources
(latent components)
Mixtures
(observed signals)
Sources
(latent components)
Mixtures
(observed signals)
Separated signals
(estimated by ICA)
Separated signals
(estimated by ICA)
14. 14
• General audio mixture
– Convolution with room reverberation
• To deconvolute (separate) them,
– apply short-time Fourier transform (STFT) and convert
signals to TF domain
– estimate frequency-wise demixing matrix
Independence-based BSS in frequency domain
Mixture without reverb.
Mixture with reverb.
Convolutive mixture in time domain
Mixture in TF domain
: freq. index
: time index
Reverb. length
15. 15
• Frequency-domain ICA (FDICA) [Smaragdis, 1998]
– applies ICA to each of frequencies separately
– estimates frequency-wise demixing matrix
Inverse matrix
Frequency-wise
mixing matrix
Frequency-wise
demixing matrix
FDICA
: freq. index
: time index
16. 16
• Frequency-domain ICA (FDICA) [Smaragdis, 1998]
– Optimization problem in FDICA
– By assuming circularly symmetric complex Laplace dist.,
– the minimization problem in FDICA becomes as
• separable w.r.t. frequency
FDICA
: Non-Gaussian complex-valued source distribution
(e.g., circularly symmetric complex Laplace distribution)
...
17. 17
• Permutation problem in FDICA
– Order of separated signals is messed up
– Alignment along the frequency
*Signal scales are also messed up, but they can be easily fixed by applying projection back technique.
ICA
In all frequency
Source 1
Source 2
Mixture 1
Mixture 2
Permutation
Solver
Separated signal 1
Separated signal 2
Time
Permutation problem
18. 18
Popular permutation solvers
• Signal correlation between frequencies
– FDICA + correlation-based clustering [Murata+, 2001], [Sawada+, 2011]
• Direction of arrival of each source (DOA)
– FDICA + DOA-based alignment [Saruwatari+, 2006]
• Co-occurrence among frequencies of each source
– Independent vector analysis (IVA) [Hiroe, 2006], [Kim, 2006] , [Kim, 2007]
• Low-rank TF modeling of each source
– Independent low-rank matrix analysis (ILRMA) [Kitamura+, 2016]
• DNN-based supervised TF modeling of each source
– Independent deeply learned matrix analysis (IDLMA) [Makishima+, 2019]
• DNN-based permutation solver
– Generalized permutation solver with training [Yamaji&Kitamura, 2020]
• Spectrogram consistency
– Consistent IVA and consistent ILRMA [Yatabe, 2020], [Kitamura+, 2020]
19. 19
• Independent vector analysis (IVA) [Hiroe, 2006], [Kim, 2006]
– utilizes sourcewise frequency vector as a random variable
– Vector source model in IVA
• Spherical property of groups
components that have co-occurrence
of all frequencies as one source
IVA
Permutation-problem-free estimation
of can be achieved!
…
…
Mixing matrix
…
…
…
Observed vector
Demixing matrix
Estimated vector
Multivariate
distribution
Have internal
correlations
Source vector
Frequency
Time
Co-occurrence of all
frequencies in each source
20. 20
• Independent vector analysis (IVA) [Hiroe, 2006], [Kim, 2006]
– How much valid is IVA’s TF structure model?
• Typical audio sources have co-occurrence of all frequencies
• Can be interpreted as “group sparsity” in TF domain
IVA
Speech source
(conversation)
Vocal source
(pop music)
21. 21
• Independent vector analysis (IVA) [Hiroe, 2006], [Kim, 2006]
– Optimization problem in IVA
– By assuming spherical Laplace dist., [Hiroe, 2006], [Kim, 2006]
– the minimization problem in IVA becomes as follows
IVA
: Non-Gaussian multivariate and spherical complex-
valued source distribution
(e.g., spherical Laplace distribution)
22. 22
• Auxiliary-function-based IVA (AuxIVA)[Ono, 2011]
– Fast and stable optimization called iterative projection (IP)
• Auxiliary function technique (or majorization-minimization algorithm)
– Convergence-guaranteed
fast and stable optimization
without stepsize parameters
Efficient optimization for IVA
Update of auxiliary variables Update of original variables
https://pyroomacoustics.readthedocs.io/en/pypi-
release/pyroomacoustics.bss.auxiva.html
Python code: Implemented in “Pyroomacoustics” library
23. 23
Frequency
Time
TF
structure
in IVA
Frequency
Time
Frequency-uniform vector
Time activation
Frequency
Basis
Basis
Time
# of bases can arbitrarily be set
To represent more complicated TF structure,
NMF modeling can be introduced, resulting in
independent low-rank matrix analysis (ILRMA)
Extension of TF structure assumed in IVA
Frequency
Time
TF
structure
in ILRMA
24. 24
ILRMA
• Independent low-rank matrix analysis (ILRMA)
– assumes each source has a low-rank TF structure
– is a unification of
• independence-based estimation of demixing matrix (FDICA or IVA)
• low-rank TF modeling of each source (NMF)
– avoids encountering the permutation problem
• TF structure is introduced as well as IVA
[Kitamura+,
2016]
Observed signal
Frequency-wise
demixing matrix
Estimated signal
Time
Frequency
Frequency
Time
Update demixing matrix so that estimated signals
are 1. mutually independent (ICA)
2. have low-rank TF structures (NMF)
STFT
Low-rank approximation by NMF
Low rank Low rank
Not low rank
25. 25
• Independent low-rank matrix analysis (ILRMA)
– Optimization problem in ILRMA
– Convergence-guaranteed
update rules
• NMF’s multiplicative update
• AuxIVA (IP)
ILRMA
[Kitamura+,
2016]
Cost function in FDICA or IVA
Estimates frequency-wise
demixing matrix
Cost function in NMF
Estimates low-rank TF structure
of each source
MATLAB code: https://github.com/d-kitamura/ILRMA
Python code: Implemented in “Pyroomacoustics” library
26. 26
Contents
• Background
– Blind source separation (BSS) for audio signals and its history
– Motivation
• Preliminaries
– Frequency-domain independent component analysis (FDICA)
– Independent vector analysis (IVA)
– Independent low-rank matrix analysis (ILRMA)
• Time-frequency-masking-based BSS (TFMBSS)
– Reformulation of BSS problems and its optimization
– BSS based on primal-dual splitting method
– Interpretation of TF masking and application
• Conclusion
28. 28
Reformulation of BSS
• All of them are coming from ICA’s cost
• Source generative model
– corresponds to TF structure model for each source
– is necessary for avoiding the permutation problem
• Better assumption of TF structures
– provides better BSS performance
Freq.
Time
Low-rank
Freq.
Time
Sparse
Freq.
Time
Group-sparse
and more
29. 29
Reformulation of BSS
• Derivation of optimization algorithm
– is problem dependent (depends on TF structure model)
– requires technical knowledges and math skills
• To try various TF structures in plug-and-play manner,
– let’s reformulate BSS problems in a more general form
– then solve it using a TF-structure-independent algorithm
BSS
algorithm
Sparse
Low-rank
Plug and play
Group-sparse
30. 30
Reformulation of BSS
• Generalized optimization problem [Yatabe&Kitamura, 2018]
–
• TF structure model for each source
• Often called “source model” in the context of BSS
• Replace this function with a plug-and-play manner
–
• Coming from an ICA theory (Jacobian between and )
• Interpreted as “barrier function” avoiding to be rank-deficient of
32. 32
Reformulation of BSS
• Generalized optimization problem [Yatabe&Kitamura, 2018]
– But, how?
• Apply convex optimization technique
– Primal-dual splitting method
– Proximity operator
• If is “proximable”, then we obtain optimization algorithm!
If we change the TF structure model ,
its optimization algorithm can easily be obtained!
Objective
[Condat, 2013], [Vu, 2013], [Komodakis+,
2015]
33. 33
Primal-dual splitting method
• Primal-dual splitting method [Condat, 2013], [Vu, 2013],
– considers following problem
– Iterative optimization algorithm
– Proximity operator
• If a proximity operator of can easily be calculated,
is called “proximable”
[Komodakis+,
2015]
Step size parameters
and : proper lower-semicontinuous
convex function
34. 34
BSS using Primal-dual splitting method
• Convert BSS to primal-dual-splitting-applicable form
– Vectorization of demixing matrices
– Matrixization
th singular value of
...
...
Mat to vec Collect all freqs.
...
35. 35
BSS using Primal-dual splitting method
• Convert BSS to primal-dual-splitting-applicable form
Introduce vectorized notation
( is a reshaped matrix that includes )
Ready to apply
primal-dual splitting!
C.f. problem for primal-dual splitting
36. 36
BSS using Primal-dual splitting method
• General BSS algorithm using primal-dual splitting
– Function is always proximable [Yatabe&Kitamura, 2018]
Singular value decomposition
37. 37
BSS using Primal-dual splitting method
• General BSS algorithm using primal-dual splitting
– L2,1 Group sparse BSS (IVA)
– Nuclear-norm-based low-rank BSS (ILRMA?)
Nuclear norm (sum of singular values)
38. 38
BSS using Primal-dual splitting method
• Multiple TF structures can also be utilized
– L2,1 group-sparse + L1 sparse BSS (sparse IVA)
– Low-rank + L1 sparse BSS (sparse ILRMA?)
Proximable Proximable
Proximable Proximable
If TF structure models are proximable,
you can use them in a plug-and-play manner!
Advantage of proposed BSS
39. 39
BSS using Primal-dual splitting method
• Experiment of two-speech-source BSS
– Compare improvement of source-to-distortion ratio (SDR)
Mixture A Mixture B
Group-sparse
Group-sparse + sparse
Low-rank + sparse
Low-rank Group-sparse
Group-sparse + sparse
Low-rank + sparse
Low-rank
40. 40
Interpretation of TF masking
• Proximity operators of many sparsity-inducing
functions are obtained as thresholding operators
– L1 norm:
– L2,1 norm:
– They have the same form: TF masking to the variable
Proximity operator TF mask (0~1 values)
determined by TF structure model
Variable in
TF shape
Elementwise product
41. 41
TMFBSS
• Time-frequency-masking-based BSS (TFMBSS)
– Skip designing TF structure model function
– TF mask of intended TF structure is employed in the
optimization algorithm
[Yatabe&Kitamura, 2021]
1. Design intended TF structure model
2. Calculate proximal operator
3. Optimize the problem
BSS based on primal-dual
splitting method TFMBSS
???
1. ―
2. Design intended TF mask
3. Optimize the problem
[Yatabe&Kitamura, 2019]
42. 42
TMFBSS
• Time-frequency-masking-based BSS (TFMBSS)
– Intended TF structure model is input to TFMBSS as a TF
mask
– Demixing matrix is optimized so that the estimated signals
have the intended TF structures
– Iterative update of TF masks are also interesting
Mixture
Frequency-wise
demixing matrix
Time
Frequency
Frequency
Time
Update demixing matrix so that the estimated signals
have TF structures enhanced by the input TF masks
STFT
Enhancement by TF masking
Time
Frequency
Frequency
Time
Time
Frequency
Frequency
Time
Estimates
[Yatabe&Kitamura, 2021]
[Yatabe&Kitamura, 2019]
43. 43
Application of TMFBSS
• HPSS-based TFMBSS [Oyabu&Kitamura, 2021]
– utilizes TF mask that is obtained via harmonic-
percussive sound separation (HPSS) in TFMBSS
44. 44
• HPSS-based TFMBSS [Oyabu&Kitamura, 2021]
Mixture
Optimization-
based HPSS
[Ono+, 2008]
Median-based
HPSS
[FitzGerald, 2010]
Optimization-
based HPSS
+
TFMBSS
Median-
based HPSS
+
TFMBSS
Application of TMFBSS
Linear, multichannel
Estimated percussive sound
Estimated harmonic sound
Nonlinear, single-channel
45. 45
Contents
• Background
– Blind source separation (BSS) for audio signals and its history
– Motivation
• Preliminaries
– Frequency-domain independent component analysis (FDICA)
– Independent vector analysis (IVA)
– Independent low-rank matrix analysis (ILRMA)
• Time-frequency-masking-based BSS (TFMBSS)
– Reformulation of BSS problems and its optimization
– BSS based on primal-dual splitting method
– Interpretation of TF masking and application
• Conclusion
46. 46
Application of TMFBSS
• Audio BSS with TF structure model
– TF structure model is necessary for avoiding the
permutation problem
• Conventional algorithms (IVA, ILRMA, and so on)
– Which TF structure is the best? Try and error
– The optimization algorithm is problem-dependent
• Changing TF structure model requires derivation of the algorithm
• Proposed generalized BSS using primal-dual splitting
– Easy to replace TF structure model
• (if the function is “proximable”)
– Easy to search the best TF structure for each BSS problem
• TFMBSS
– Explicitly define TF structure as TF masking
Notas del editor
Hi everyone, thank you for coming to my overview presentation.
The title is Blind Audio Source Separation Based on Time-Frequency Structure models
First of all, let me introduce myself.
This is the contents of this talk; Background, Preliminaries, main topic, and conclusion.
The first topic is background.
This talk treats blind source separation problem, BSS, which is a separation technique of individual sources from the recorded mixture.
The word “blind” means “unsupervised”.
Thus, the BSS method does not require any prior information about the recording conditions and sources, such as locations of microphones, sources, room geometry, training dataset of sound sources, and so on.
This kind of technique is very useful for many applications.
For example, hearing aid systems, automatic speech recognition, and preprocessing for music analysis.
This is a demonstration of music BSS using the method called ILRMA.
Here we have a mixture signal of three parts, which was recorded using three microphones.
Please pay attention to listen three parts, guitar, vocal, and keyboard, OK? Let’s listen.
Then, if we apply ILRMA to this multichannel signal, we can obtain this kind of estimates.
So, we can remix them, re-edit them, or anything we want. This is a source separation.
By the way, the source code of ILRMA is available here, so please check it.
In BSS for audio signals, numbers of microphones and sources are important.
In this talk, we only consider a “determined” situation, namely, the numbers of microphones and sources are equal.
If we want to separate three sources, we have to put three microphones.
In the determined situation, the BSS problem becomes an estimation of the demixing system W, which is an inverse system of the mixture A.
Here we show the historical overview in this slide, where only the related methods are shown here.
There are three columns, determined, underdetermined, and single-channel.
The origin of determined BSS is independent component analysis, ICA.
And the important methods in this talk are IVA, AuxIVA, and ILRMA.
In this talk, we review this column, namely, from ICA to the newest method called TFMBSS from the viewpoint of the utilized time-frequency structure models in each method.
I here explain the motivation of this talk.
The conventional determined BSS have advantages.
One is a minimum distortion.
Since these algorithms separate sources by multiplying frequency-wise demixing matrices, we can avoid artificial distortion as much as possible.
Another advantage is a fast and stable optimization.
In AuxIVA, very efficient algorithm called iterative projection was proposed, and this advantage was inherited to ILRMA.
IVA and ILRMA assumes their own time-frequency structure models.
However, if this model does not fit to the actual sources in the mixture, the BSS performance is degraded.
So, we want to try various TF structure models in BSS.
But we need to derive the optimization algorithms for each of TF structure models.
Motivated by this issue, we propose a new BSS algorithm that can easily replace TF structure model and can easily search the best one.
This is the main topic of this talk.
5分
The next one is Preliminaries.
I’m gonna review the conventional methods from ICA to ILRMA.
ICA is a fundamental algorithm for BSS.
ICA assumes that the source distributions are mutually independent and non-Gaussian.
Also, the mixing system is modeled by a multiplication of mixing matrix A, which is invertible and time-invariant.
Based on these assumptions, ICA estimates the demixing matrix W, which is ideally an inverse matrix of A.
The estimation theory in ICA is here.
ICA minimizes the similarity between these distributions.
This is equivalent to a maximization of independence between the separated sources.
Since the separated signal y includes the demixing matrix, the optimization problem in ICA can be formulated as this problem, where p(y) is a non-Gaussian source distribution we need to assume.
So, we find W that minimizes this function.
However, ICA has two ambiguities: scales and permutation.
ICA cannot determine the scales and the order of the estimated signals.
In particular, the permutation ambiguity will be a serious problem in an audio BSS problem.
For audio mixture signals, simple ICA cannot separate the sources.
This is because the mixture of audio signals is not the multiplication of A but the convolution of mixing filters, which is due to the room reverberation.
To deconvolute the mixture, we apply short-time Fourier transform and convert signals to TF domain.
Since convolution in the time domain becomes multiplication in the TF domain, we can apply ICA and estimates frequency-wise demixing matrix.
This method is called frequency-domain ICA, FDICA in short.
We apply ICA to each of frequencies separately.
Then, we estimate the demixing matrix Wi, where i is the index of frequencies and j is the index of time frames.
Optimization problem in FDICA is formulated like this, and p(y) is a source distribution in the TF domain.
Complex Laplace distribution, shown here, is often used for this assumption, and the minimization problem can be obtained like this.
However, FDICA encounters the serious problem, which is so-called the permutation problem.
In FDICA, simple ICA is performed in each frequency separately.
Therefore, the order of the estimated signals is messed up along the frequency axis.
Even if we completely separate the sources in each frequency, we have to take an alignment of the order of them along the frequency.
Several permutation solvers have been proposed so far.
I here listed popular permutation solvers.
Before 2006, the permutation solver was a post processing (戻って) as shown in this figure, which uses correlation between frequencies or direction of arrival.
Then, independent vector analysis, IVA, and independent low-rank matrix analysis, ILRMA, were proposed.
These methods are a unification of ICA and permutation solver.
From this slide, we review the important BSS algorithms, IVA and ILRMA, from the viewpoint of the TF structure models.
IVA is a multivariate extension of FDICA, namely, IVA utilizes sourcewise frequency vector as a random variable to unify all the frequency components in the estimation of ICA.
IVA assumes a joint distribution of all the frequency components as a source distribution p(s).
In addition, this distribution p(s) has an inner structure, a co-occurrence of all the frequency components.
This model is called “spherical property” of multivariate distribution, but anyway, ICA assumes the co-occurrence of all the frequency components in the same source, which is depicted in this figure.
By the assumption of this TF structure for each source, Wi is estimated so that the permutation problem does not arise.
10分
The question is how much valid is IVA’s TF structure model?
I here showed the time-frequency powers of speech and vocal sources.
As you can see, typical audio sources have co-occurrence of all the frequencies when the source is active, and IVA’s assumption seems to be valid.
Also, this structure can be interpreted as group sparsity in the TF domain.
The optimization problem in IVA can be defined like this, and the joint distribution p will enforce previous TF structure by assuming the spherical distribution here.
For example, when we assume a spherical Laplace distribution, this model, the minimization problem in IVA becomes as shown in the bottom.
In the original IVA paper, this problem was optimized by a simple gradient descent, but
in 2011, an efficient update algorithm for IVA was proposed, which is called AuxIVA.
It provides an elegant update rules called iterative projection, IP, and the convergence-guaranteed fast optimization without stepsize parameters was established.
This graph shows the value of cost function and the number of iterations.
AuxIVA sufficiently converges in less than 20 times update.
I play the sound demo of AuxIVA.
In 2016, we extended the TF structure model in IVA to richer one.
IVA assumes the uniform co-occurrence of all the frequencies.
This can be considered as a rank-1 time-frequency structure, namely, frequency-uniform vector is activated along time axis.
As we already shown, this model is valid for typical audio signals, but it may be too simple because audio sources have a harmonic frequency structure.
To represent more complicated TF structure, we proposed independent low-rank matrix analysis, ILRMA, which employs NMF modeling as a TF structure.
In ILRMA, the single uniform frequency vector in IVA is extended to the multiple complicated vectors, and more accurate spectrogram can be modeled as a low-rank matrix.
Such an accurate TF model will improve the estimation performance of the frequency-wise demixing matrices.
ILRMA assumes that each source has a low-rank TF structure, and the rank of mixture spectrogram increases.
Thus, by enforcing the low-rankness of each estimated signal in the TF domain, the demixing matrix can avoid encountering the permutation problem, and richer TF structure model than IVA will improve the BSS performance.
14分
The optimization problem in ILRMA is shown here.
We find Wi, and the NMF variables Tn and Vn that minimize this cost function.
(クリック)The first and second terms of this function coincide with the cost function in NMF, (クリック)and the second and third terms coincide with the cost function in FDICA or IVA.
(クリック)Thus, we can iterate NMF update rules and IP-based update of the demixing matrix.
This iteration guarantees the theoretical convergence.
This graph shows the behavior of the cost function value.
ILRMA converges in less than 100 iterations.
Let’s play the sample sounds.
This result is better than that of IVA.
15分くらい
Let’s move on to the main topic of this talk.
So far, we showed the cost functions of FDICA, IVA, and ILRMA, which are listed in this slide.
We can see that they have the similar forms.
This is because
all of them are coming from the original ICA’s cost function, this one, and the difference is just an assumption of the source distribution p(Y), which is often called source generative model.
This generative model corresponds to the TF structure model for each source, and this model is necessary for avoiding the permutation problem.
Of course, better assumption of TF structures provides better BSS performance, but the suitable TF structure model depends on the type of sources, such as speech, music, harmonic source, percussive source, noise source, and so on.
Therefore, we have to search the best TF structure model with a try-and-error approach.
However, in the conventional method, it is difficult to replace the TF structure model because we have to derive the optimization algorithm, which requires technical knowledges and math skills.
If we derive a general BSS algorithm, and if we can replace the TF structure model in a plug-and-play manner, it is very useful to search the best model for each problem.
So, to try various TF structure models in a “plug-and-play manner”, first, we reformulate the BSS problem in a more general form.
Then, we solve it using a TF-structure-independent algorithm.
17分
This problem is our proposed generalized BSS problem, which includes FDICA, IVA, and ILRMA.
The function P(W, X) corresponds to the TF structure model we assume, which is often called the source model.
By replacing the function P, we can try various TF structure models.
The negative log-determinant term is coming from an original ICA theory.
We can interpret this function as a “barrier function” avoiding to be rank-deficient of Wi.
If Wi becomes a rank-deficient matrix, its determinant becomes zero, and this term becomes infinity.
So, we can avoid such solution in the optimization.
18分
For the conventional BSS algorithm, the function P(W, X) corresponds to these functions, respectively.
FDICA corresponds to an L1-norm sparse regularizer, and IVA is an L2,1-norm group-sparse regularizer.
ILRMA is a little bit difficult, but still we can represent it using an argument minimum as shown here, where DIS is an Itakura-Saito divergence.
The objective of this reformulation is that / if we change the TF structure model P, its optimization algorithm can easily be obtained.
This is because we want to establish a new BSS algorithm with plug-and-play TF structure models.
But the question is, how can we do that?
The idea is coming from a convex optimization field.
We utilize an algorithm called “primal-dual splitting method”.
In this algorithm, we need a proximity operator of the function P.
The function whose proximity operator can easily be calculated / is called “proximable”.
So, if the TF structure model P is proximable, we can obtain the optimization algorithm for this generalized BSS problem.
Primal-dual splitting method considers this problem.
Minimize the vector w for the function g(w) + h(Lw), where L is just a matrix.
This minimization can be solved by this iterative optimization algorithm. This is a primal-dual splitting method.
In the first line, we calculate the proximity operator of the function g with this input.
Then, the second line calculates the new input z, and in the third line, we calculate the proximity operator of the function h with the input z.
By iterating these three steps, we can minimize this cost function.
Prox is a regularized minimization of the function f in the neighborhood of input x, which always has a unique solution.
We do not dive into the details of this algorithm in this overview, but you can referrer some papers to know the theory of the method.
The important point is that, we can use any function P, any TF structure P if the functions P are all proximable.
We just switch the proximity operator of P according to the recipe of well-known proximity operators of popular functions.
21分
The goal is to convert this minimization function to the primal-dual-splitting-applicable form.
So, we convert this function (戻って)to this.
As a first step, we transform the determinant of Wi to the singular values sigma using this equation.
Next, we vectorize the demixing matrices Wi with this computation, where V is a linear operator converting a matrix Wi into a vector
And we also define the inverse operation M, namely, M is a linear operator converting the vector w back into the matrices Wi.
By introducing the vectorization, we get this function. Its almost there.
Then, we define I(w) like this, and now we are ready to apply the primal-dual splitting method.
Now we have the same form as this original function.
In summary, we defined the general BSS algorithm as this minimization problem, and we can optimize this using a primal-dual splitting method.
The algorithm is shown here.
And we have a proximity operator of a new function I in this line.
I(w) is a sum the logarithm of singular values. The proximity operator of the Logarithm function and singular values are well-known.
Thus, we can easily obtain the proximity operator of I(w) as shown in the bottom of this slide.
OK, let me see how IVA and ILRMA are defined in this BSS formulation.
The TF structure assumed in IVA is group sparseness, which can be defined as L2,1 norm of the estimated spectrogram Yn.
So, we replace the function P to the L2,1 norm, and we do not have to resolve the algorithm.
The proximity operator of L2,1 norm is obtained like this, so we use this calculation in the third line of this algorithm.
Next, ILRMA assumes the low-rank TF structure by applying NMF to the estimated spectrogram Yn.
Instead of NMF, we use a nuclear norm to represent the low-rank regularization.
Again, the proximity operator of the nuclear norm is well-known.
We can obtain the optimization algorithm by replacing the third line to this calculation.
From this, we can see that the proposed algorithm can handle various TF structures in a unified algorithm, which is very useful to search the best TF structure.
In addition, multiple TF structures can also be utilized.
For example, group sparse + sparse BSS can be defined like this function, which can be interpreted as a sparse IVA.
Of course, these functions are both proximable, we can obtain the optimization algorithm.
As another example, low-rank + sparse BSS can also be defined as sparse ILRMA like this problem.
As you can see, the important point is that, when you want to utilize a new TF structure model P, check whether P is proximable.
If P is proximable, you can use it in the proposed BSS algorithm in a plug-and-play manner.
This is a strong advantage of the proposed BSS.
25分半
These graphs show the BSS performance of two-speech mixtures with AuxIVA and various TF structures.
The vertical axis shows SDR improvements, which indicates the separation performance.
And the horizontal axis shows the number of iterations in each algorithm.
Since the group-sparse model is equivalent the IVA model, it provides the completely same performance in the converged point.
Low-rank model is similar to ILRMA, and group sparse + sparse model is a sparsity-induced IVA.
Also, low-rank + sparse is a sparse version of ILRMA.
Again, we can easily compare which TF structure model is the best for the speech source separation.
In this experiment, Low-rank + sparse model provides the best performance for both mixture samples.
26分半
Now we have extended the proposed BSS algorithm to more explicit formulation, namely, we do not assume a function P, but we directly introduce TF mask as an intended TF structure.
Let me explain this extension as a final topic of this talk.
It is known that the proximity operators of many sparsity-inducing functions are obtained as thresholding operators.
For example, prox of L1 norm is obtained like this, and this calculation is soft thresholding of the input variable because this term becomes a value between 0 and 1.
L2,1 norm also becomes soft thresholding.
Since the input vector z includes spectrograms of the estimated signals, these soft thresholding in each element can be interpreted as a time-frequency soft masking.
Namely, the calculation of proximity operator, (戻って)the third line of the algorithm, is just applying a TF soft mask defined by the intended TF model and the current optimization variable Z.
This fact tells us that we don’t have to design a TF structure function P.
Just we have to do is to design a TF mask of the intended TF structure.
28分
From this motivation, we proposed time-frequency-masking-based BSS, TFMBSS in short.
The different point between the previous general BSS and TFMBSS is shown here.
In the previous algorithm, we had to design the TF model function P, and we obtain its proximity operator.
In TFMBSS, we skip designing the function P, and we directly design the intended TF mask.
Therefore, we don’t care about what kind of cost function is minimized in this algorithm.
This figure is a concept of TFMBSS.
We input TF masks as a TF structure model.
And the demixing matrix is optimized so that the estimated signals have the intended TF structures.
Let me introduce one application of TFMBSS.
We utilized a well-known music BSS algorithm called harmonic-percussive sound separation, HPSS, to accurately separate drum sounds and the other musical instruments.
In this method, we apply HPSS to the temporal estimated signals Zharmonic and Zpercussive independently and produce the masks in a Wiener filtering manner.
These masks are input to TFMBSS as a TF structure model. This process is iterated until it converges, so in each iteration of TFMBSS, two HPSS are performed.
This is a demonstration.
We utilized two types of HPSS.
Since HPSS is a single-channel nonlinear algorithm, the artificial distortions may arise.
If we have a multichannel observation, we can use these HPSS in TFMBSS and achieve linear distortion-less separation.
The red cells are harmonic estimates, and the blue ones are the percussive estimates.
再生
As you can see, TFMBSS provides better separation.