Blind audio source separation based on time-frequency structure models

13th Asia Pacific Signal and Information Processing Association
Annual Summit and Conference (APSIPA ASC 2021)
Overview Session OS-1: Acoustic Signal Processing
Blind Audio Source Separation Based
on Time-Frequency Structure Models
Daichi Kitamura
National Institute of Technology, Kagawa College
Japan

2
• Daichi Kitamura
• National Institute of Technology, Kagawa College
• Research interests
– Audio source separation
– Array signal processing
– Machine learning
– Music signal processing
– Biosignal processing
Self introduction

3
Contents
• Background
– Blind source separation (BSS) for audio signals and its history
– Motivation
• Preliminaries
– Frequency-domain independent component analysis (FDICA)
– Independent vector analysis (IVA)
– Independent low-rank matrix analysis (ILRMA)
• Time-frequency-masking-based BSS (TFMBSS)
– Reformulation of BSS problems and its optimization
– BSS based on primal-dual splitting method
– Interpretation of TF masking and application
• Conclusion

4
Contents
• Background
– Motivation
• Preliminaries
• Conclusion

5
• Blind source separation (BSS) for audio signals
– estimates specific audio sources in the observed mixture
– does not require prior information of recording conditions
• locations of mics and sources, room geometry, timbres, etc.
• The word “blind” means “unsupervised”.
– is available for many audio applications
• Hearing aid systems
• Automatic speech recognition (ASR)
• Preprocessing for music analysis etc.
Background: BSS for audio signals
Observed mixture
BSS
Estimated source signals

6
• Music BSS using ILRMA
Guitar
Vocal
Keyboard
Guitar
Vocal
Keyboard
BSS
Please pay attention to listen
three parts in the mixture.
MATLAB code: https://github.com/d-kitamura/ILRMA
Python code: Implemented in “Pyroomacoustics” library

7
• Numbers of mics and sources
• Consider only “determined” situation
– # of mics # of sources
– BSS estimates “demixing system” (inverse of mixing)
Source signals Observed signals Estimated signals
Mixing system Demixing system
Monaural rec.
1ch
Single-channel signal Mic array
1ch
Mch
Multichannel signal
2ch
…
…

8
Spectral subtraction
Time-frequency masking
Many other methods
Beamforming
Sparse coding
Time-frequency masking
DOA clustering
Many other methods
Historical overview (only the methods related in this talk)
1994
1998
2013
1999
2012
Permutation solvers
Extension of models
Generative models
Frequency-domain ICA
Itakura-Saito NMF
IVA
2016
2009
2006
2011 AuxIVA
Time-varying IVA
Multichannel NMF
2018 IDLMA
Single-channel
Spatial covariance model
Spatial covariance
model+DNN
Supervised approaches
based on deep neural
networks (DNN)
ICA
[Comon], [Bell and Sejnowski],
[Cardoso], [Amari], [Cichocki], …
[Smaragdis]
[Saruwatari], [Murata],
[Morgan], [Sawada], …
[Hiroe], [Kim]
[Ono]
[Ono]
[Kitamura]
[Nugraha]
[Ozerov, Sawada]
[Duong]
[Févotte]
[Lee]
[Virtanen], [Smaragdis],
[Kameoka], [Ozerov], …
2010
Underdetermined
Determined
[Yatabe&Kitamura]
2021
Time-freq.-masking-
based BSS (TFMBSS)
[Mogami]
NMF
ILRMA
Gray-colored methods
are “supervised”
(not fully blind)

9
Motivation of determined BSS
• Conventional BSS: IVA, AuxIVA, and ILRMA
– Minimum distortion (linear demixing)
– Relatively fast and stable optimization
• Iterative projection (AuxIVA) [Ono+, 2010], [Ono, 2011]
– Time-frequency (TF) structure model affects performance
• IVA: co-occurrence along frequency axis
• ILRMA: NMF-based low-rank time-frequency structure
– Optimization algorithm depends on the TF model
• Difficult to derive update rules
• Easily replace TF model and search the best one
– Time-frequency-masking-based BSS (TFMBSS)
: frequency bins
Observed
signal
Source signals
Frequency-wise mixing matrix
: time frames
Estimated
signal
Frequency-wise demixing matrix
[Yatabe & Kitamura, 2021]

10
Contents
• Background
– Motivation
• Preliminaries
• Conclusion

11
Independence-based BSS in time domain
• Independent component analysis (ICA) [Comon, 1994]
– If we assume
– then we can estimate demixing matrix
• by maximizing independence between the estimates ( and )
Mixing matrix
Sources
(latent components)
1. Mutually
independent
2. Non-Gaussian
3. Invertible and
time-invariant
Mixtures
(observed signals)
Inverse matrix

12
– Maximizes independence between source distributions
– Optimization problem in ICA
Minimize
similarity
： Non-Gaussian source distribution
(e.g., Laplace distribution)
...

13
– However,
• 1. Signal scales (volumes) cannot be determined
• 2. Signal permutation cannot be determined
Sources
(latent components)
Mixtures
(observed signals)
Sources
(latent components)
Mixtures
(observed signals)
Separated signals
(estimated by ICA)
Separated signals
(estimated by ICA)

14
• General audio mixture
– Convolution with room reverberation
• To deconvolute (separate) them,
– apply short-time Fourier transform (STFT) and convert
signals to TF domain
– estimate frequency-wise demixing matrix
Independence-based BSS in frequency domain
Mixture without reverb.
Mixture with reverb.
Convolutive mixture in time domain
Mixture in TF domain
: freq. index
: time index
Reverb. length

15
• Frequency-domain ICA (FDICA) [Smaragdis, 1998]
– applies ICA to each of frequencies separately
– estimates frequency-wise demixing matrix
Inverse matrix
Frequency-wise
mixing matrix
Frequency-wise
demixing matrix
FDICA
: freq. index
: time index

16
• Frequency-domain ICA (FDICA) [Smaragdis, 1998]
– Optimization problem in FDICA
– By assuming circularly symmetric complex Laplace dist.,
– the minimization problem in FDICA becomes as
• separable w.r.t. frequency
FDICA
： Non-Gaussian complex-valued source distribution
(e.g., circularly symmetric complex Laplace distribution)
...

17
• Permutation problem in FDICA
– Order of separated signals is messed up
– Alignment along the frequency
*Signal scales are also messed up, but they can be easily fixed by applying projection back technique.
ICA
In all frequency
Source 1
Source 2
Mixture 1
Mixture 2
Permutation
Solver
Separated signal 1
Separated signal 2
Time
Permutation problem

18
Popular permutation solvers
• Signal correlation between frequencies
– FDICA + correlation-based clustering [Murata+, 2001], [Sawada+, 2011]
• Direction of arrival of each source (DOA)
– FDICA + DOA-based alignment [Saruwatari+, 2006]
• Co-occurrence among frequencies of each source
– Independent vector analysis (IVA) [Hiroe, 2006], [Kim, 2006] , [Kim, 2007]
• Low-rank TF modeling of each source
– Independent low-rank matrix analysis (ILRMA) [Kitamura+, 2016]
• DNN-based supervised TF modeling of each source
– Independent deeply learned matrix analysis (IDLMA) [Makishima+, 2019]
• DNN-based permutation solver
– Generalized permutation solver with training [Yamaji&Kitamura, 2020]
• Spectrogram consistency
– Consistent IVA and consistent ILRMA [Yatabe, 2020], [Kitamura+, 2020]

19
• Independent vector analysis (IVA) [Hiroe, 2006], [Kim, 2006]
– utilizes sourcewise frequency vector as a random variable
– Vector source model in IVA
• Spherical property of groups
components that have co-occurrence
of all frequencies as one source
IVA
Permutation-problem-free estimation
of can be achieved!
…
…
Mixing matrix
…
…
…
Observed vector
Demixing matrix
Estimated vector
Multivariate
distribution
Have internal
correlations
Source vector
Frequency
Time
Co-occurrence of all
frequencies in each source

20
– How much valid is IVA’s TF structure model?
• Typical audio sources have co-occurrence of all frequencies
• Can be interpreted as “group sparsity” in TF domain
IVA
Speech source
(conversation)
Vocal source
(pop music)

21
– Optimization problem in IVA
– By assuming spherical Laplace dist., [Hiroe, 2006], [Kim, 2006]
– the minimization problem in IVA becomes as follows
IVA
： Non-Gaussian multivariate and spherical complex-
valued source distribution
(e.g., spherical Laplace distribution)

22
• Auxiliary-function-based IVA (AuxIVA)[Ono, 2011]
– Fast and stable optimization called iterative projection (IP)
• Auxiliary function technique (or majorization-minimization algorithm)
– Convergence-guaranteed
fast and stable optimization
without stepsize parameters
Efficient optimization for IVA
Update of auxiliary variables Update of original variables
https://pyroomacoustics.readthedocs.io/en/pypi-
release/pyroomacoustics.bss.auxiva.html

23
Frequency
Time
TF
structure
in IVA
Frequency
Time
Frequency-uniform vector
Time activation
Frequency
Basis
Basis
Time
# of bases can arbitrarily be set
To represent more complicated TF structure,
NMF modeling can be introduced, resulting in
independent low-rank matrix analysis (ILRMA)
Extension of TF structure assumed in IVA
Frequency
Time
TF
structure
in ILRMA

24
ILRMA
• Independent low-rank matrix analysis (ILRMA)
– assumes each source has a low-rank TF structure
– is a unification of
• independence-based estimation of demixing matrix (FDICA or IVA)
• low-rank TF modeling of each source (NMF)
– avoids encountering the permutation problem
• TF structure is introduced as well as IVA
[Kitamura+,
2016]
Observed signal
Frequency-wise
demixing matrix
Estimated signal
Time
Frequency
Frequency
Time
Update demixing matrix so that estimated signals
are 1. mutually independent (ICA)
2. have low-rank TF structures (NMF)
STFT
Low-rank approximation by NMF
Low rank Low rank
Not low rank

25
• Independent low-rank matrix analysis (ILRMA)
– Optimization problem in ILRMA
– Convergence-guaranteed
update rules
• NMF’s multiplicative update
• AuxIVA (IP)
ILRMA
[Kitamura+,
2016]
Cost function in FDICA or IVA
Estimates frequency-wise
demixing matrix
Cost function in NMF
Estimates low-rank TF structure
of each source
MATLAB code: https://github.com/d-kitamura/ILRMA

26
Contents
• Background
– Motivation
• Preliminaries
• Conclusion

27
Reformulation of BSS
• Cost functions of independence-based BSS
– FDICA w/ Laplace
– IVA w/ spherical Laplace
– ILRMA w/ Itakura-Saito NMF

28
• All of them are coming from ICA’s cost
• Source generative model
– corresponds to TF structure model for each source
– is necessary for avoiding the permutation problem
• Better assumption of TF structures
– provides better BSS performance
Freq.
Time
Low-rank
Freq.
Time
Sparse
Freq.
Time
Group-sparse
and more

29
• Derivation of optimization algorithm
– is problem dependent (depends on TF structure model)
– requires technical knowledges and math skills
• To try various TF structures in plug-and-play manner,
– let’s reformulate BSS problems in a more general form
– then solve it using a TF-structure-independent algorithm
BSS
algorithm
Sparse
Low-rank
Plug and play
Group-sparse

30
• Generalized optimization problem [Yatabe&Kitamura, 2018]
–
• TF structure model for each source
• Often called “source model” in the context of BSS
• Replace this function with a plug-and-play manner
–
• Coming from an ICA theory (Jacobian between and )
• Interpreted as “barrier function” avoiding to be rank-deficient of

31
– FDICA w/ Laplace (L1 sparse regularizer)
– IVA w/ spherical Laplace (L2,1 group-sparse regularizer)
– ILRMA w/ Itakura-Saito NMF (low-rank approximation)
Freq. vector

32
– But, how?
• Apply convex optimization technique
– Primal-dual splitting method
– Proximity operator
• If is “proximable”, then we obtain optimization algorithm!
If we change the TF structure model ,
its optimization algorithm can easily be obtained!
Objective
[Condat, 2013], [Vu, 2013], [Komodakis+,
2015]

33
Primal-dual splitting method
• Primal-dual splitting method [Condat, 2013], [Vu, 2013],
– considers following problem
– Iterative optimization algorithm
– Proximity operator
• If a proximity operator of can easily be calculated,
is called “proximable”
[Komodakis+,
2015]
Step size parameters
and : proper lower-semicontinuous
convex function

34
BSS using Primal-dual splitting method
• Convert BSS to primal-dual-splitting-applicable form
– Vectorization of demixing matrices
– Matrixization
th singular value of
...
...
Mat to vec Collect all freqs.
...

35
• Convert BSS to primal-dual-splitting-applicable form
Introduce vectorized notation
( is a reshaped matrix that includes )
Ready to apply
primal-dual splitting!
C.f. problem for primal-dual splitting

36
• General BSS algorithm using primal-dual splitting
– Function is always proximable [Yatabe&Kitamura, 2018]
Singular value decomposition

37
• General BSS algorithm using primal-dual splitting
– L2,1 Group sparse BSS (IVA)
– Nuclear-norm-based low-rank BSS (ILRMA?)
Nuclear norm (sum of singular values)

38
• Multiple TF structures can also be utilized
– L2,1 group-sparse + L1 sparse BSS (sparse IVA)
– Low-rank + L1 sparse BSS (sparse ILRMA?)
Proximable Proximable
Proximable Proximable
If TF structure models are proximable,
you can use them in a plug-and-play manner!
Advantage of proposed BSS

39
• Experiment of two-speech-source BSS
– Compare improvement of source-to-distortion ratio (SDR)
Mixture A Mixture B
Group-sparse
Group-sparse + sparse
Low-rank + sparse
Low-rank Group-sparse
Group-sparse + sparse
Low-rank + sparse
Low-rank

40
Interpretation of TF masking
• Proximity operators of many sparsity-inducing
functions are obtained as thresholding operators
– L1 norm:
– L2,1 norm:
– They have the same form: TF masking to the variable
Proximity operator TF mask (0~1 values)
determined by TF structure model
Variable in
TF shape
Elementwise product

41
TMFBSS
– Skip designing TF structure model function
– TF mask of intended TF structure is employed in the
optimization algorithm
[Yatabe&Kitamura, 2021]
1. Design intended TF structure model
2. Calculate proximal operator
3. Optimize the problem
BSS based on primal-dual
splitting method TFMBSS
???
1. ―
2. Design intended TF mask
3. Optimize the problem

42
TMFBSS
– Intended TF structure model is input to TFMBSS as a TF
mask
– Demixing matrix is optimized so that the estimated signals
have the intended TF structures
– Iterative update of TF masks are also interesting
Mixture
Frequency-wise
demixing matrix
Time
Frequency
Frequency
Time
Update demixing matrix so that the estimated signals
have TF structures enhanced by the input TF masks
STFT
Enhancement by TF masking
Time
Frequency
Frequency
Time
Time
Frequency
Frequency
Time
Estimates

43
Application of TMFBSS
• HPSS-based TFMBSS [Oyabu&Kitamura, 2021]
– utilizes TF mask that is obtained via harmonic-
percussive sound separation (HPSS) in TFMBSS

44
• HPSS-based TFMBSS [Oyabu&Kitamura, 2021]
Mixture
Optimization-
based HPSS
[Ono+, 2008]
Median-based
HPSS
[FitzGerald, 2010]
Optimization-
based HPSS
+
TFMBSS
Median-
based HPSS
+
TFMBSS
Linear, multichannel
Estimated percussive sound
Estimated harmonic sound
Nonlinear, single-channel

45
Contents
• Background
– Motivation
• Preliminaries
• Conclusion

46
• Audio BSS with TF structure model
– TF structure model is necessary for avoiding the
permutation problem
• Conventional algorithms (IVA, ILRMA, and so on)
– Which TF structure is the best? Try and error
– The optimization algorithm is problem-dependent
• Changing TF structure model requires derivation of the algorithm
• Proposed generalized BSS using primal-dual splitting
– Easy to replace TF structure model
• (if the function is “proximable”)
– Easy to search the best TF structure for each BSS problem
• TFMBSS
– Explicitly define TF structure as TF masking

Blind audio source separation based on time-frequency structure models

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Blind audio source separation based on time-frequency structure models

Similar a Blind audio source separation based on time-frequency structure models (20)

Más de Kitamura Laboratory

Más de Kitamura Laboratory (20)

Último

Último (20)

Blind audio source separation based on time-frequency structure models

Notas del editor