1. RAMIN ANUSHIRAVANI
ECE 551
FALL 2014
Sound Source Localization
with Microphone Arrays
Red box
ignore me, if you wish!
BOLD and GREEN
LOOK AT ME!
1
I need 7 minutes and 45 second to finish up
2. Outline
Background
Application
Human Sound Localization
Time Delay Model
Beamforming
Signal Model
Criteria
Microphone Arrays
Uniform Linear Array (ULA)
Beampattern
Spatial Aliasing
Sound Source Localization
Conventional Beamforming
MUSIC algorithm
Results
Where are you
technology?
Where is my
hearing aid?
2
3. Application
3
Why localizing a
sound source is
useful?
Improving Speech
recognition
Speech Enhancement
Hearing aids
Audio Surveillance
Teleconferencing
Spatial Audio
Background
4. How do we localize sound?
Interaural time
difference - ITD
Interaural level
difference - ILD
Spectral information
Background
4
5. A Time Delay Model
τ =
d sin(θ)
𝑐
τ is the time delay
between the two sensors.
(.-)τ -> exp(jωτ)
Time Domain Frequency Domain
Where c is the speed of
sound.
Θ : Angle of arrival
d
θ
d sin(θ)
Ref
Background
Far Field Assumption
: Distance between the two sensors
5
6. Delay the reference signal until the
sum of the energy of the two signals
is at its maximum (or undo the
delay from the delayed signal).
Example
𝑡𝑟𝑢𝑒 𝑑𝑒𝑙𝑎𝑦 = arg max(||𝑟𝑒𝑓(𝑛 + 𝑚)+delayed(n)||) = arg 𝑚𝑎𝑥 𝐶 𝑥𝑦 [m]
= 𝑚=0
𝑛
𝑑𝑒𝑙∗(𝑛)ref(n+m)
For x(n) =ref (n) , y(n) = delayed signal (n)
m (samples) -> τ (seconds) = d sin(θ)/ c
Background
6
8. Beamforming
Spatial Filtering
Detect and estimate the
output of a sensor array
Types
Fixed vs Adaptive
Beamformer
Delay and Sum (Filter and
Sum)
MVDR (Capon)
Narrowband vs
Broadband
Beamformer
Z(k) = 𝑾 𝑯Y(k)
Z(k)
Y1(k)
Y2(k)
Source
Recorded
at mics
Filters
8
Beamforming [S P. Boyd]
9. Signal Model
𝒚 𝒏 𝒕 = 𝒈 𝒏 𝒕 ∗ 𝒔 𝒕 + 𝒗 𝒏 𝒕 = 𝒙 𝒏 𝒕 + 𝒗 𝒏 𝒕
y : received signal at each microphone
g : spatial response corresponding to
the source location
s : source signal
v: noise
𝑌𝑛 𝑓 = 𝐺 𝑛 𝑓 𝑆 𝑓 + 𝑉𝑛 𝑓 = 𝑋 𝑛 𝑓 + 𝑉𝑛 𝑓 = d(f)𝑋1(𝑓)+v(f)
Where,
d: steering (direction) vector
X1: recorded signal at the first (ref) microphone
Beamforming
For simplicity we assume x
and v are uncorrelated.
9
11. Beamforming Criteria
Signal to Noise Ratio
Array Gain
Output SNR over the input SNR.
Noise Rejection
Amount of noise being rejected by
the beamformer.
Beampattern
Represent the response of the beamformer to an arbitrary input signal as
a function of the steering vectors (microphone array impulse response).
Beamforming [Benesty et al.]
11
14. ULA
Collecting signal from a source with
microphone where the spacing
between each element is Δ.
Signal received at the 𝑚 𝑡ℎ
microphone:
𝑥 𝑚 𝑡 = 𝑠𝑖(𝑡)
𝑖=1
𝑑
𝑒 𝑗(𝑚−1)μ𝑖 + 𝑛 𝑚(𝑡)
Where, 𝜇𝑖 = (-2π/λ) Δsin(θ𝑖) : Spatial frequency
𝑥 = 𝐴𝑠 𝑡 + 𝑛(𝑡)
Microphone Array [Bhuiy et al.]
14
Steering vectors based on a
time delay model for one
frequency (narrowband)
MIC
Angles
S
x
A
15. Beampattern
𝑒−𝑖𝜔𝑡
= 𝑒−𝑖2π𝑓𝑡
=
𝑒−𝑖2π𝑘(𝑆)𝑑sin(θ)/𝑐𝑝
Where,
k : discrete frequency
S : Sampling Rate
p : Number of DFT samples.
We can visualize the
steering vector by
plotting the steering
vectors over all
angles for an arbitrary
input for any number of
microphones.
Microphone Array
[𝒆−𝒊𝝎𝟎
⋯ ⋯ 𝒆−𝒊𝝎𝟎
] [𝒆−𝒊𝝎𝝉 𝟏 ⋯ ⋯ 𝒆−𝒊𝝎𝝉 𝒏]
Ref Delayed by 𝜏𝑖 span over
all angles and frequencies
Add and normalized by the number of
microphones for some arbitrary input.
Steering
Vectors
15
16. ITD Polar Pattern
Main lobe
Grating lobe
Spatial
Aliasing
16
1000 Hz 4000 Hz
2 Mic
22 cm
apart
2 Mic
2 cm
apart
Microphone Array
18. Spatial Aliasing
Aliasing
“If the bandwidth of the signal exceeds half of the
sampling frequency, the spectral replicas overlap,
leading to a distortion in the observed spectrum.”
Spatial Aliasing
“The spacing between adjacent microphone elements should
be less than half of the wavelength corresponding to the
highest temporal frequency of interest.”
Microphone Arrays [J. P. Dmochowski et al.]
18
19. If distance between adjacent
microphones > λ/2
Where λ = speed of sound/ frequency
Spatial Aliasing Frequency > 1600 Hz
Spatial Aliasing
Beamforming
Main lobe
Grating lobeOmni Response
19
20. Sound Source Localization
Using beamforming and subspace methods to
localize a sound source.
Delay and Sum - Classical Beamformer
Capon – Minimum Variance Distortionless Response (MVDR)
beamformer
Multiple signal classification (MUSIC) - A Subspace Algorithm
20
22. Minimum Variance Distortionless Response
A delay and sum beamformer with an additional
constraint on the output power,
𝒘 𝑴𝑽𝑫𝑹 = 𝒂𝒓𝒈 𝒎𝒊𝒏 (𝒘 𝑯 𝑹 𝒙𝒙 𝒘) s.t. 𝒘 𝑯A(θ) = 1
Constrain the look direction gain to be, g(φ𝑖
) = 1 and
minimize the output power of the beamformer. φ
Source Localization [J. Capon]
22
23. Minimum Variance Distortionless Response
𝑤 𝑀𝑉𝐷𝑅 = 𝑎𝑟𝑔 𝑚𝑖𝑛 (𝑤 𝐻
𝑅 𝑥𝑥 𝑤) s.t. 𝑤 𝐻
A(θ) = 1.
This lead to the Lagrangian,
𝐽 𝑤, λ = 𝑤 𝐻
𝑅 𝑥𝑥w+λ(𝑤 𝐻
A(θ)-1)(A(θ)
𝐻
w-1)
After having lots of fun it turns out that,
𝒘 𝑴𝑽𝑫𝑹(𝜽) =
𝑹 𝒙𝒙
−𝟏
A(θ)
(A(θ)
𝑯
𝑹 𝒙𝒙
−𝟏
A(θ))
𝑷 𝑴𝑽𝑫𝑹 𝜽 =
𝟏
(A(θ)
𝑯
𝑹 𝒙𝒙
−𝟏
A(θ))
Source Localization
23
25. MUSIC
𝑹 𝒙𝒙 = [𝑼 𝒔 𝑼 𝒏 ]
𝝀 𝟏 ⋯ 𝟎
⋮ ⋱ ⋮
𝟎 ⋯ 𝝀 𝑴
𝑼 𝒔
𝑯
𝑼 𝒏
𝑯
Where,
𝑈𝑠 = signal subspace
𝑈 𝑛 = noise subspace
And λ1>λ2 > ⋯>λ 𝑀.
Span(𝑈𝑠 ) = span(A(θ))
MUSIC uses the orthogonally between the noise subspace and the
steering vectors.
𝑈 𝑛 ⊥ A(θ) => 𝑈 𝑛
𝐻A(θ)= 0.
Source Localization
You need to know how many sources you have.
25
26. MUSIC
MUSIC Pseudo Spectrum is defined as,
𝑷 𝑴𝑼𝑺𝑰𝑪 𝜽 =
𝟏
||𝑼 𝒏
𝑯A(θ)||
=
𝟏
A(θ)
𝑯
𝑼 𝒏 𝑼 𝒏
𝑯A(θ)
MUSIC Spatial Spectrum is defined as,
𝑷 𝑴𝑼𝑺𝑰𝑪 𝜽 =
𝟏
||𝑼 𝒏
𝑯A(θ)||
=
A(θ)A(θ)
𝑯
A(θ)
𝑯
𝑼 𝒏 𝑼 𝒏
𝑯A(θ)
=> MUSIC measures of the orthogonality between steering
vectors of the array and the noise subspace. The poles of this
expression points to the direction of the signal source.
Source Localization
26
29. Results
• 2 sources at 15 and
-25 degree
• 4 Microphones
29
2 Source, 4 Microphone
30. Results
RMSE Delay and Sum MVDR MUSIC
Accuracy 2 sources 0.7035 0.1012 0.0851
Accuracy 4 sources 0.4992 0.4990 0.1903
30
• Which one is more robust to noise?
• Which one is more robust to reverberation?
• Which one give out a higher SNR for
enhancing speech?
• …etc.
Localization accuracy[Bhuiya et al.]
𝑅𝑀𝑆𝐸 =
1
𝑘
𝑘=1
𝐾
(Θ 𝑒𝑠𝑡_𝑘 − Θ 𝑡𝑟𝑢𝑒_𝑘)2
K : number of audio blocks (group of frames)
31. Citation
Benesty, Jacek P. Dmochowski , Microphone Arrays: Fundamental
Concepts , Springer
Bhuiya, F. Islam, M , Analysis of Direction of Arrival Techniques Using
Uniform Linear Array , International Journal of Computer Theory and
Engineering
J. Capon. High-resolution frequency-wavenumber spectrum analysis.
Proc. IEEE, 57(8), 1408–1418 (1969).
Kawitkar, R , Performance of Different Types of Array Structures Based
on Multiple Signal Classification (MUSIC) algorithm, International
Conference on MEMS NANO, and Smart Systems
Richter, I , Spatial Filtering and DoA Estimation MVDR Beamformer
and MUSIC Algorithm , Sensor Array Signal Processing
S P. Boyd, R , ROBUST MINIMUM VARIANCE BEAMFORMING
31
Notas del editor
Need at least two microphone to localize sounds, two ears 22 cm apart
Simulate your head with two microphones and no head!
appendix
appendix
Narrowband beamformer
fixed. Brain and detection. Remove
We’re trying to find some filters that would increase this SNR say in an speech enhancement application. Measure of beamformer goodness.
Series of Microphone . Geometry is given? Derive the spatial responses for the microphone and do fun application like speech enhancement, noise reduction, dereverberation, sound ,signal estimation, source localization, …. Beamforming is a strong tool in array signal processing
The idea is to „steer‟ the array in one direction at a time and
measure the output power. The steering direction which
coincides with the DOA of a signal and result in a maximum
output power yields the DOA estimates.
You can also calibrate a microphone array by playing MLS sequence.
Formally, the beampattern
is defined as the ratio of the variance of the beamformer output when the
source impinges with a steering vector d(f) to the variance of the desired
signal x1(t).
Too many, octave frequency 1 2 4 8 kHZ
Produce by : Delay and sum beamforming
Of two mics, can’t distinguish back and front since the time delays are the same.
Too many, octave frequency 1 2 4 8 kHZ
Produce by : Delay and sum beamforming
Of two mics, can’t distinguish back and front since the time delays are the same.
J. P. Dmochowski and J. Benesty
In order to reconstruct a spatial sinusoid from a set of uniformly-spaced discrete spatial samples, the spatial sampling period must be less than half of the sinusoid’s wavelength.
Super low frequencies: omni response.
Narrow beam at zero degrees => good
But high energy at all these other angles, cannot distinguish the difference.
Of course we almost never hear one frequency, but a wide range of frequencies.
So , conclusion here is that there is more to sound localaization besides ITD.
Also Barlett method. A(θ) is defined as the steering vector with a scanning angle θ.
The idea is to scan across the angular region of interest. In speech enhancement you can fix the angle
Form a beam toward an angle and capture those desired signals, in sound source localization, you want to scan over all angles and
Look for when the power is maximum.
Take the gradient of J with respect to h and lambda and use the constraint on power. MVDR requires a good estimation of the covariance matrix. There has to be at least as many observation as sensors in the array.
Eigenvalue decomposition
For true covariance matrix this is approximately zero. Corresponds to the smallest eigenvalues.
We need to have one noise subspace at least, and that requires having one more sensor than the number of location, that we can resolve M-1 sources with M sensors.
For this first experiment I used the far most right and left channel. I recorded at about 75 degrees.
Some problems with using all4 mics,
Speech enhancement and reverberant would be better with music, much narrower beam.
I used 4 microphones, two sources, one speech one was a loud fan. Only music was able to identify both sources others smears the two sources into one.
One is at -25 and the other 15 degrees.
I’m only looking at angle location and how narrow the beam is one should also look at noise and reverberation when localizing sound sources