Alternative system of medicine herbal drug technology syllabus
Audio Source Separation Based on Low-Rank Structure and Statistical Independence
1. Audio Source Separation Based on Low-Rank
Structure and Statistical Independence
The University of Tokyo
Research Associate
Daichi Kitamura
Nagoya University, Lecture
May 30, 2017
2. Introduction
• Daichi Kitamura (北村大地)
• Research Associate of The University of Tokyo
• Academic background
– Kagawa National Collage of Technology (2005 ~ 2012)
• B.S. in Engineering (March 2012)
– Nara Institute of Science and Technology (2012 ~ 2014)
• M.S. in Engineering (March 2014)
– SOKENDAI (2014 ~ 2017)
• Ph.D. in Informatics (March 2017)
• Research topics
– Media signal processing
– Audio source separation
2
3. Contents
• Research background
– Audio source separation and its applications
– Demonstration
• Structural modeling of audio sources
– Time-frequency representation
– Low-rank modeling of audio spectrogram
– Supervised audio source separation
• Statistical modeling between sources
– Blind audio source separation
– Audio distribution and central limit theorem
– Maximization of independence
• Conclusion and future works
3
4. Contents
• Research background
– Audio source separation and its applications
– Demonstration
• Structural modeling of audio sources
– Time-frequency representation
– Low-rank modeling of audio spectrogram
– Supervised audio source separation
• Statistical modeling between sources
– Blind audio source separation
– Audio distribution and central limit theorem
– Maximization of independence
• Conclusion and future works
4
5. • Audio source separation
– Signal processing
– Separation of speech, music sounds, background noise, …
– Cocktail party effect by a computer
Research background
5
6. • Audio source separation
– Signal processing
– Separation of speech, music sounds, background noise, …
– Cocktail party effect by a computer
Research background
6
7. Research background
7
Separate
Automatic transcription
CD
• Application of audio source separation
– Hearing aid
• Easy to talk in a loud environment
– Speech recognition systems
• Siri, Google search, Cortana, Amazon Echo, …
– Automatic music transcription
• Musical part separation (Vo., Gt., Ba., …)
– Remix of live-recorded music
• Professional use (improving quality), personal use (DJ remixing), …
9. Demonstration: music source separation
• Music source separation
9
Guitar
Vocal
Keyboard
Guitar
Vocal
Keyboard
Source
separation
Pay attention to
listen three parts
in the mixture.
10. Contents
• Research background
– Audio source separation and its applications
– Demonstration
• Structural modeling of audio sources
– Time-frequency representation
– Low-rank modeling of audio spectrogram
– Supervised audio source separation
• Statistical modeling between sources
– Blind audio source separation
– Audio distribution and central limit theorem
– Maximization of independence
• Conclusion and future works
10
For monaural
signals
For stereo or
multichannel
signals
11. Contents
• Research background
– Audio source separation and its applications
– Demonstration
• Structural modeling of audio sources
– Time-frequency representation
– Low-rank modeling of audio spectrogram
– Supervised audio source separation
• Statistical modeling between sources
– Blind audio source separation
– Audio distribution and central limit theorem
– Maximization of independence
• Conclusion and future works
11
For monaural
signals
For stereo or
multichannel
signals
16. • Sparse (for both speech and music)
– Strong (yellow) components are fewer
– Weak (darker) components are dominant
• Continuous contour (only in speech)
– Spectrum continuously and dynamically changes
• Low rank (especially in music)
– Including similar patterns (similar timbres) many times
Structural properties
16Speech Music
18. • Low-rankness (simplicity of a matrix)
– can be measured by a cumulative singular value (CSV)
– Drums and guitar are quite low-rank
• Also, vocals and speech are to some extent low-rank
– Music spectrogram can be modeled by few patterns
Comparison of low-rankness
18
95% line
7 29 Around 90
Number of bases
when CSV reaches 95%
(Spectrogram size is 1025x1883)
19. Modeling technique of low-rank structures
• Nonnegative matrix factorization (NMF) [Lee, 1999]
– is a low-rank approximation using limited number of bases
• Bases and their coefficients must be nonnegative
– can be applied to a power spectrogram
• Spectral patterns (typical timbres) and their time-varying gains
19
Amplitude
Amplitude
Nonnegative matrix
(power spectrogram)
Basis matrix
(spectral patterns)
Activation matrix
(time-varying gains)
Time
: # of frequency bins
: # of time frames
: # of bases
Time
Frequency
Frequency
Basis
Activation
20. • Parameters optimization in NMF
– Minimize “similarity measure” between and
– Arbitrarily measure for similarity can be used
• Squared Euclidian distance , etc.
– Closed form solution is still an open problem
– Iterative calculation can minimize
• Multiplicative update rules [Lee, 2000]
Modeling technique of low-rank structures
20
(for the case of squared Euclidian distance)
21. Modeling technique of low-rank structures
• Example
21
Pf. and Cl.
Superposition of
rank-1 spectrogram
22. Modeling technique of low-rank structures
• Example
– Pf. and Cl. are separated!
– Source separation based on NMF
• is a clustering problem of the obtained spectral bases in
– But how?
22
Pf. Cl.
Pf. and Cl.
23. • If the sourcewise training data is available,
• Supervised NMF [Smaragdis, 2007], [Kitamura1, 2014]
Supervised audio source separation with NMF
23
Separation stage
Training stage
Given
Spectral
dictionary of Pf.
Other bases
Only , , and are optimized
24. • Demonstration
– Stereo music separation with supervised NMF [Kitamura, 2015]
Supervised audio source separation with NMF
24
Original song
Training
sound of Pf.
Separated
sound (Pf.)
Training
sound of Ba.
Separated
sound (Ba.)
25. • Performance will be limited
– when the difference of timbres between training data and
target source in the mixture becomes large
Problem of supervised approach
25
Mixture sound
Target Different Pf.
Slightly
different
Training data
60
40
20
0
-20
Amplitude[dB]
3.02.52.01.51.00.50.0
Frequency [kHz]
Real sound
Artificial sound by MIDI
Difference of timbres
Mixture
(actual Pf. & Tb.)
Separated signal
using artificial Pf.
as training data
Supervised
NMF
26. • Supervised NMF with basis deformation [Kitamura, 2013]
– employs to adaptively deform pre-trained bases in
Adaptive supervised audio source separation
26
Training stage
Deformation term (positive and negative)
Slightly
different
Separation stage
Given
27. • Constraint in deformation term
– Range of deformation is restricted
– To avoid excess deformation of
Adaptive supervised audio source separation
27
Mixture
(actual Pf. & Tb.)
Separated signal
Supervised
NMF
Separated signal
Supervised NMF with
basis deformation
Training data is the same
(artificial Pf. sound)
Frequency Frequency
±30%
For the case of
29. Contents
• Research background
– Audio source separation and its applications
– Demonstration
• Structural modeling of audio sources
– Time-frequency representation
– Low-rank modeling of audio spectrogram
– Supervised audio source separation
• Statistical modeling between sources
– Blind audio source separation
– Audio distribution and central limit theorem
– Maximization of independence
• Conclusion and future works
29
For monaural
signals
For stereo or
multichannel
signals
30. Multichannel recording using microphone array
• Number of microphones and sources
– Overdetermined situation (# of sources # of mics.)
– Underdetermined situation (# of sources # of mics.)
• a priori information
– Training data of the source, position of sources, room
geometry, music scores, etc.
– Blind source separation (BSS): without any a priori info. 30
Sources Observed Estimated
Mixing system Demixing system
Microphone array
CD
L-ch
R-ch
Stereo signal (2-ch) One mic.
1-ch
Monaural signal (1-ch)
31. BSS and independent component analysis
• Blind source separation (BSS)
– Estimate demixing system without any prior information
about the mixing system
• Typical BSS is based on a statistical independence
• Independent component analysis (ICA) [Comon, 1994]
– How to measure a statistical independence?
– Define a “distribution of audio signals”
– Find demixing system that maximizes independence
31
Demixing systemMixing system
32. What is the distribution of audio signals?
• Distribution of speech waveform
13
Amplitude
Time samples
Spiky and heavy-tailed
than Gaussian (Normal)
distribution
Amountofcomponents
Amplitude
0
0.1
0.2
0.3
0.4
0.5
-5 -4 -3 -2 -1 0 1 2 3 4 5
Gaussian distribution
33. What is the distribution of audio signals?
• Distribution of Piano waveform
13
Amplitude
Time samples
Spiky and heavy-tailed
than Gaussian distribution
Amountofcomponents
Amplitude
0
0.1
0.2
0.3
0.4
0.5
0.6
-5 -4 -3 -2 -1 0 1 2 3 4 5
Laplace distribution
34. What is the distribution of audio signals?
• Distribution of Drums waveform
13
Amplitude
Time samples
Spiky and heavy-tailed
than Gaussian distribution
Amountofcomponents
Amplitude
0
0.2
0.4
0.6
0.8
1
-5 -4 -3 -2 -1 0 1 2 3 4 5
Cauchy distribution
35. Central limit theorem
35
• Audio source distribution is basically non-Gaussian
– But still we don’t know the source distribution
• How to model them for source separation?
• Central limit theorem
– “A sum of any kind of random variables always approaches
to having a Gaussian distribution.”*
• Can’t believe? Let’s see
0
0.1
0.2
0.3
0.4
0.5
0.6
-5 -4 -3 -2 -1 0 1 2 3 4 5
Laplace distribution
0
0.002
0.004
0.006
0.008
0.01
-5 -4 -3 -2 -1 0 1 2 3 4 5
Uniform distribution
Generate r.v.s
Gaussian distribution
0
0.1
0.2
0.3
0.4
0.5
-5 -4 -3 -2 -1 0 1 2 3 4 5
* Several r.v.s do not obey, e.g., Cauchy r.v.
36. Central limit theorem
36
• is pips of first dice, and is pips of second dice
–
– Probability is always 1/6
• Results of 1 million trials for each dice
– What about ?
Amount
Amount
37. Central limit theorem
37
• is pips of first dice, and is pips of second dice
–
– Probability is always 1/6
• Results of 1 million trials for each dice
– What about ?
Amount
Not a uniform distribution any more
38. Central limit theorem
38
• is pips of first dice, and is pips of second dice
–
– Probability is always 1/6
• Results of 1 million trials for each dice
Amount
Amount
39. Central limit theorem
39
• is pips of first dice, and is pips of second dice
–
– Probability is always 1/6
• Results of 1 million trials for each dice
– Approaches to a Gaussian distribution (central limit theorem)
40. Central limit theorem in audio signals
40
• is an th speakers signal
–
– , around 3.3 s
Amplitude
Time samples
Amount
Amplitude
Amplitude
Time samples
Amount
Amplitude
41. Central limit theorem in audio signals
41
• is an th speakers signal
–
– , around 3.3 s
Amplitude
Time samples
AmountAmplitude
42. Central limit theorem in audio signals
42
• is an th speakers signal
–
– , around 3.3 s
Amplitude
Time samples
Amount
Amplitude
Amplitude
Time samples
Amount
Amplitude
43. • is an th speakers signal
–
– , around 3.3 s
Central limit theorem in audio signals
43
Amplitude
Time samples
AmountAmplitude
44. • is an th speakers signal
–
– , around 3.3 s
Central limit theorem in audio signals
44
Amplitude
Time samples
AmountAmplitude
Almost a
Gaussian dist.
(central limit
theorem)
45. Principle of ICA
45
• What we can say from central limit theorem
– Gaussian distribution is a limitation of mixture of sources
– If we maximize non-Gaussianity of all signals,
the signals will be the original sources before they mixed
Basic principle of ICA
Maximizing
non-Gaussianity
Maximizing
independence
between components
More general,
Approaching to Gaussian
(central limit theorem)
Departing from Gaussian
(ICA)
46. Principle of ICA
• Assumption in ICA
– 1. Sources are mutually independent
– 2. Each source distribution is non-Gaussian
– 3. Mixing system is invertible and time-invariant
Mixing matrix
Sources
(latent components)1. Mutually
independent
2. Non-Gaussian
3. Invertible and
time-invariant
10
Mixtures
(observed signals)
Inverse matrix
47. Principle of ICA
• Uncertainty in ICA
– 1. Signal scale (volume) cannot determined
– 2. Signal permutation cannot determined
11
ICA
ICA
Sources
(latent components)
Mixtures
(observed signals)
Sources
(latent components)
Mixtures
(observed signals)
Separated signals
(estimated by ICA)
Separated signals
(estimated by ICA)
48. • Estimation in ICA
– Maximize independence between source distributions
– log-likelihood function
Principle of ICA
12
Minimize
distance
: Non-Gaussian source distribution
Generally, is set to an appropriate non-Gaussian distribution
49. • Audio mixture in actual environment
– Convolutive mixture with reverberation
• Ex. office room has 300 ms, concert hall is more than 2000 ms
– Mixing coefficient becomes mixing filter
• How to deconvolute them?
– 1. Estimate deconvolution filter
• In 16 kHz sampling, the filter with 300 ms includes 4800 taps
• # of parameters that should be estimated explodes
– 2. Estimate demixing coefficient in frequency domain
• Frequency-wise demixing matrix should be estimated by ICA
• encountering permutation problem
ICA-based separation of reverberant mixture
49
Reverberation length
(length of convolution filter)
Simultaneous mixture
Convolutive mixture
50. ICA-based separation of reverberant mixture
• Frequency-domain ICA (FDICA) [Smaragdis, 1998]
– Apply simple ICA to each frequency bin
50
Spectrogram
ICA1
ICA2
ICA3
…
…
ICA
Frequencybin
Time frame
…
Inverse matrix
Frequency-wise
mixing matrix
Frequency-wise
demixing matrix
51. ICA-based separation of reverberant mixture
51
• Permutation problem in frequency-domain ICA
– Order of separated signals in each frequency is messed up*
– Have to take an alignment through the frequency
*Scales are also messed up, but they can be easily fixed.
ICA
In all frequency
Source 1
Source 2
Mixture 1
Mixture 2
Permutation
Solver
Separated signal 1
Separated signal 2Time
52. ICA-based separation of reverberant mixture
• Popular permutation solvers
– Based on direction of arrival (DOA)
• Frequency-domain ICA + DOA alignment [Saruwatari, 2006]
– Based on a relative correlation among frequencies
• Independent vector analysis (IVA) [Hiroe, 2006], [Kim, 2006]
– Based on a low-rank modeling of each source
• Independent low-rank matrix analysis (ILRMA) [Kitamura, 2016]
• Demonstration of BSS using ILRMA
– http://d-kitamura.net/en/demo_rank1_en.htm
52
53. Contents
• Research background
– Audio source separation and its applications
– Demonstration
• Structural modeling of audio sources
– Time-frequency representation
– Low-rank modeling of audio spectrogram
– Supervised audio source separation
• Statistical modeling between sources
– Blind audio source separation
– Audio distribution and central limit theorem
– Maximization of independence
• Conclusion and future works
53
54. Conclusions and future works
• Audio source separation based on
– Low-rank property
• Nonnegative matrix factorization
– Statistical independence
• Blind source separation
• For further improving
– Separation based on a huge dataset training
• Deep learning, denoising auto encoder, etc.
• Recording condition is juts one-time
– Informed source separation
• Music scores could be a powerful information
• User can induce the system, and leads more accurate separation
• Performance is still insufficient
– Almost there? Not at all! Make our life better. That’s an engineering.
54
Duration
Region