SlideShare una empresa de Scribd logo
1 de 269
Descargar para leer sin conexión
Introduction to Estimation Theory
Bayesian (Random) Parameter
Estimation Nonrandom Parameter Estimation
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.1/149
In estimation problem we assign a cost to all pairs [a, â(r)]
over the range of interest.
In many cases of interest it is realistic to assume that the
cost depends only on the error:
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.2/149
aǫ(r) = (â(r) − a)2
aǫ(r) = |â(r) − a|
aǫ(r) =

0, |aǫ| 6 ∆
2
0, |aǫ|  ∆
2
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.3/149
Our goal is to find an estimate that minimizes the expected
value of the cost
R = E {c [a, â(r)]} =
Z ∞
−∞
Z ∞
−∞
c [a, â(r)] p(a, r) drda
R is the risk involved in doing the estimation of a out of
observation(s) r.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.4/149
RRMS =
Z ∞
−∞
Z ∞
−∞
(â(r) − a)2
p(a, r) drda
=
Z ∞
−∞
drp(r)
Z ∞
−∞
(â(r) − a)2
p(a|r)da
Because p(r)  0 we can minimize the inner integral:
d
dâ
Z ∞
−∞
(â(r) − a)2
p(a|r) da

= −2
Z ∞
−∞
ap(a|r) da
+2â(r)
Z ∞
−∞
p(a|r) da = 0 ⇒
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.5/149
Then, the mean square estimate is represented as:
âRMS(r) =
Z ∞
−∞
a p(a|r)da
We have seen this before as the conditional mean.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.6/149
d
dâ

−2
Z ∞
−∞
ap(a|r)da + 2â(r)
Z ∞
−∞
p(a|r) da

= 2  0
Because the second derivative is positive âRMS is the mini-
mum.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.7/149
The Bayes estimate for the absolute value criterion:
Rabs =
Z ∞
−∞
Z ∞
−∞
|â(r) − a| p(a, r) drda
=
Z ∞
−∞
drp(r)
Z ∞
−∞
|â(r) − a| p(a|r)da
The inner integral:
I(r) =
Z â(r)
−∞
(â(r) − a) p(a|r)da +
Z ∞
â(r)
(a − â(r)) p(a|r)da ⇒
d
dâ(r)
I(r) = 0 ⇒
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.8/149
Z â(r)
−∞
(â(r) − a) p(a|r)da =
Z ∞
â(r)
(a − â(r)) p(a|r)da ⇒
This is the definition for the median. The absolute error
criterion leads to the determination of estimate of a out of
the median of the observation(s) r.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.9/149
Runif =
Z ∞
−∞
drp(r)

1 −
Z âunif(r)+ ∆
2
âunif(r)− ∆
2
p(a|r)da
#
Minimizing Runif amounts to maximizing
Z âunif(r)+ ∆
2
âunif(r)− ∆
2
p(a|r)da ⇒
âunif(r) occurs where p(a|r)|â(r)=a is maximum.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.10/149
This is MAP. A necessary, but not sufficient condition for a
Max. is
d
da
{ln p(a|r)}
âMAP=a
= 0
p(a|r) =
p(r|a)p(a)
p(r)
⇒
ln p(a|r) = ln p(r|a) + ln p(a) − ln p(r)
| {z }
not a
function of a
⇒
max {ln p(a|r)} ≡ max {ln p(r|a) + ln p(a)}
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.11/149
Example
ri = a+ni, i = 1, 2, · · · , N, a ∼ N(0, σa), ni ∼ N(0, σn) ⇒
p(r|a) = ΠN
i=1
1
σn
√
2π
exp

−
(ri − a)2
2σ2
n

p(a) =
1
σa
√
2π
exp

−
a2
2σ2
a

We need to compute
R ∞
−∞
ap(a|r)da, one approach could be
p(a|r) = p(a)p(r|a)/p(r) but this is tedious. However, one
can observe that p(a|r) is PDF then:
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.12/149
p(a|r) =
1
p(r)
(
1
(2π)N/2σN
n
1
√
2πσa
exp

−
PN
i=1(ri − a)2
2σ2
n
#
exp

−
a2
2σ2
a
)
p(a|r) = k(r) exp



−
1
2σ2
p

a −
σ2
a
σ2
a + σ2
n/N
1
N
N
X
i=1
ri
!#2



σ2
p =

1
σ2
a
+
N
σ2
n
−1
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.13/149
We see that p(a|r) is Gaussian, then: âMS(r) is the
conditional mean:
âMS(r) =
σ2
a
σ2
a + σ2
n/N
1
N
N
X
i=1
ri
!
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.14/149
if σ2
a ≫ σ2
n
N
⇒ a priori knowledge is much better
than the observed data.
if σ2
a ≪ σ2
n
N
⇒ a priori knowledge is not enough and
the estimate uses the received data.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.15/149
For MAP:
The location p(a|r) is maximum is the mean of Gaussian⇒
âMAP(r) = âMS(r) =
σ2
a
σ2
a + σ2
n/N
1
N
N
X
i=1
ri
!
Also because the median of Gaussian occurs at the mean
then for this problem:
âMAP(r) = âMS(r) = âabs(r)
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.16/149
This invariance to choice of cost function is obviously be-
cause of the subjective judgements that are frequently in-
volved in choosing the cost function.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.17/149
An example of a nonlinear problem:
rm = a3
+ νm, m = 1, 2, · · · , M, νk ∼ N(0, σν), a ∼ N(0, σa)
p(a|r) = k(r) exp
(
−
1
2
PM
m=1 (rm − a3
)
2
σ2
n
+
a2
σ2
a
#)
âMAP(r) =
(PM
m=1 [rm − a3
] (3a2
)
σ2
n
+
a
σ2
a
)
a=âMAP
= 0
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.18/149
Example:
Pr(n even|a) =
an
n!
e−a
, n = 0, 1, 2, · · · , ∞
p(a) = λe−λa
, a  0 ⇒
Pr(a|n) =
Pr(n|a)p(a)
Pr(n)
= k(n)

an
n!
e−a
λe−λa

because
R ∞
0
p(a|n) da = 1 ⇒
k(n) =
(1 + λ)(n+1)
λ
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.19/149
âMS(n) =
Z ∞
0
ap(a|n) da =
n + 1
λ + 1
âMAP(n) = max {ln p(r|a) + ln p(a)} =
n
λ + 1
Z âabs
0
p(a|n) da =
Z ∞
âabs
p(a|n) da = polynomial solution, no closed form
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.20/149
The first measure of quality is:
E{â(r)} =
Z ∞
−∞
â(r) p(r|a) dr
1. if E{â(r)} = a, unbiased estimate.
2. if E{â(r)} = a + b, biased, but known.
3. if E{â(r)} = a + b(a), biased, but unknown.
Even an unbiased estimate could yield bad results.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.21/149
Usually the PDF of the estimate is centered around a.
Therefore, the second measure of quality is the variance of
the estimate.
var[â(r) − a] = E

[â(r) − a]2
	
− B2
(a)
General strategy:
We shall try to find an unbiased estimate with small vari-
ance.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.22/149
Maximum Likelihood Estimation (MLE):
r = a + n, p(r|a) = N(a, σn)
We choose the value of a that most likely caused a given
value of a.
The likelihood function (LF) of the observation given the a
is p(r|a), or the log-likelihood function (LLF) ln p(r|a)
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.23/149
We maximize LF or LLF with respect to the unknown
parameter
âML(r) is the value of a at which p(r|a) is maximum.
If âML(r) is interior to a, and ln p(r|a) has a maximum then
a = âML(r) is that value.
The ML estimate is the limiting value of MAP as the a priori
knowledge → 0:
MAP:





∂
∂a
ln p(r|a) +
∂
∂a
ln p(a)
| {z }
a priori knowledge





a=âMAP
= 0
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.24/149
If a(r) is any unbiased estimate of a ⇒
var[a(r) − a]  E
(
∂
∂a
ln p(r|a)
2
)!−1
var[a(r) − a] 

−E

∂2
∂a2
ln p(r|a)
−1
These are called CRLB.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.25/149
Any estimate that satisfies CRLB with equality is called an
efficient estimate.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.26/149
Because â(r) is unbiased:
E[a(r) − a] =
Z ∞
−∞
p(r|a)[a(r) − a] dr = 0
∂
∂a
{E[a(r) − a]} =
Z ∞
−∞
∂p(r|a)
∂a
[a(r) − a] dr − 1 = 0
∂p(r|a)
∂a
=
∂ ln p(r|a)
∂a
p(r|a)
Z ∞
−∞
∂ ln p(r|a)
∂a
p(r|a)[a(r) − a] dr = 1
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.27/149
Z ∞
−∞

∂ ln p(r|a)
∂a
p
p(r|a)
 p
p(r|a)[a(r) − a]

dr = 1
Using Schwartz inequality:
Z ∞
−∞

∂ ln p(r|a)
∂a
2
p(r|a) dr
Z ∞
−∞
p(r|a)[a(r) − a]2
dr
| {z }
var[a(r)−a]
 1 ⇒
var[a(r) − a] 

E
n ∂
∂a
ln p(r|a)
2
o−1
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.28/149
Equality holds iff:
∂ ln p(r|a)
∂a
= k(a)[a(r) − a]
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.29/149
For the 2nd representation:
Z ∞
−∞
p(r|a) dr = 1 ⇒
Z ∞
−∞
∂p(r|a)
∂a
dr =
Z ∞
−∞
∂ ln p(r|a)
∂a
p(r|a) dr = 0
differentiating again:
Z ∞
−∞
∂2
ln p(r|a)
∂a2
p(r|a) dr +
Z ∞
−∞

∂ ln p(r|a)
∂a
2
p(r|a) dr = 0
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.30/149
E

∂2
ln p(r|a)
∂a2

= −E
(
∂ ln p(r|a)
∂a
2
)
Then, the 2nd representation results.
1. From CRLB, any unbiased estimate must have a
variance greater than a certain limit.
2. if ∂ ln p(r|a)/∂a = k(a)[a(r) − a], âML(r) will satisfy the
CRLB with equality
∂ ln p(r|a)
∂a
a=âML(r)
= 0 = k(a)[a(r) − a] ⇒
a(r) = âML(r) or k(âML(r)) = 0
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.31/149
Example:
ri = a + ni, i = 1, 2, · · · , N, ni ∼ N(0, σn)
∂ ln p(r|a)
∂a
=
N
σ2
n

1
N
N
X
i=1
ri − a
#
= 0 ⇒ âML(r) =
1
N
N
X
i=1
ri
E [âML(r)] =
1
N
N
X
i=1
E(ri) =
1
N
N
X
i=1
a = a ⇒
âML(r) is the unbiased estimator.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.32/149
The variance of the estimator:
∂2
ln p(r|a)
∂a2
= −
N
σ2
n
⇒
var[âML(r) − a] =
σ2
n
N
→ 0 N → ∞
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.33/149
Example:
Pr(n event |a) =
an
n!
e−a
, n = 0, 1, 2, · · · , N
∂ ln p(n = N|a)
∂a
=
∂
∂a
[N ln a − a − ln N!] =
N
a
−1 =
1
a
[N−a] ⇒ âML(N) = N
∂2
ln p(n = N|a)
∂a2
= −
N
a2
⇒
var [âML(N) − a] = a
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.34/149
Example:
ri = s(a) + ni, i = 1, 2, · · · , N, ni ∼ N(0, σn)
p(r|a) =
1
√
2πσn
N
exp

−
PN
i=1 (ri − s(a))2
2σ2
n
#
∂ ln p(r|a)
∂a
=
1
σ2
n
N
X
i=1
(ri − s(a))
∂s(a)
∂a
In general cannot be written in the form required by:
∂ ln p(r|a)
∂a
= k(a)[a(r) − a]
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.35/149
Therefore, an unbiased efficient estimate does not exist.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.36/149
1
σ2
n
∂s(a)
∂a
 N
X
i=1
ri − Ns(a)
!
a=âML(r)
⇒
âML(r) = s−1 1
N
N
X
i=1
ri
!
∂2
ln p(r|a)
∂a2
=
1
σ2
n
N
X
i=1
[ri − s(a)]
∂2
s(a)
∂a2
−
N
σ2
n

∂s(a)
∂a
2
⇒
E

∂2
ln p(r|a)
∂a2

= −
N
σ2
n

∂s(a)
∂a
2
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.37/149
Because E{
P
ri − s(a)} = 0 ⇒
var[âML(r) − a] 
σ2
n
N
h
∂s(a)
∂a
i2
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.38/149
Example of Bayesian Estimation
Suppose we collect n Poisson distributed data points with
mean θ
Yi ∼ Poiss(θ)
p(yi|θ) = e−θ θyi
yi!
, yi ∈ 0, 1, 2...
Likelihood is:
p(y|θ) =
n
Y
i=1
e−θ θyi
yi!
Suppose prior is exponentially distributed with mean 1/b
p(θ) = b exp(−bθ)u(θ)
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.39/149
Posterior is:
p(θ|y) =
p(y|θ)p(θ)
p(y)
=
 n
Q
i=1
e−θ θyi
yi!

b exp(−bθ)
∞
R
0
 n
Q
i=1
e−θ θyi
yi!

b exp(−bθ)dθ
E[Θ|Y = y] =
∞
Z
−∞
θp(θ|y)dθ =
∞
Z
0
θe−(n+b)θ
θT
∞
R
0
e−(n+b)θ̃θ̃T dθ̃
dθ
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.40/149
∞
Z
0
θe−(n+b)θ
θT
dθ =
∞
Z
0
e−(n+b)θ
θT+1
dθ =
Γ(T + 2)
(n + b)T+2
, T ≡
n
X
i=1
yi
E[Θ|Y = y] =
Γ(T+2)
(n+b)T +2
Γ(T+1)
(n+b)T +1
=
(T+1)!
(n+b)T +2
T!
(n+b)T +1
E[Θ|Y = y] =
1 +
n
P
i=1
yi
n + b
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.41/149
MAP Estimate:
p(y|θ)p(θ) = e−nθ θT
n
Q
i=1
yi!
b exp(−bθ)
ln p(θ|y) = −nθ + T ln θ − bθ
d
dθ
ln p(θ|y) = −n +
T
θ
− b = 0
θ̂MAP (y) =
T
n + b
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.42/149
Minimum Absolute Error(MAE)
Must solve:
θ̂MAE(y)
Z
−∞
p(θ|y)dθ = 1/2
θ̂MAE(y)
Z
0
e−(n+b)θ
θT (n + b)T+1
T!
dθ = 1/2
θ̂MAE(y)
Z
−∞
e−(n+b)θ
θT
dθ
| {z }
incomplete Gamma function
=
T!
2(n + b)T+1
Because: AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.43/149
∞
Z
0
e−(n+b)θ
θT
dθ =
Γ(T + 1)
(n + b)T+1
=
T!
(n + b)T+1
The solution for MAE is based on expressing as an “incom-
plete Gamma function” (gammainc in MATLAB); this will
have an inverse, so you could solve for θ̂MAE(y).
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.44/149
Suppose we instead want to estimate
γ =
√
θ
Can we just say
γ̂CME(y) =
q
θ̂CME(y), CME = Conditional Mean Square
γ̂MAP (y) =
q
θ̂MAP (y)
We’ll see the answer is NO.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.45/149
Note that knowing γ is equivalent to knowing θ
pY |Γ(y|γ) = pY |Θ(y|θ)
θ=γ2
=
n
Y
i=1
e−θ θyi
yi!
θ=γ2
=
n
Y
i=1
e−γ2 (γ2
)
yi
yi!
transformations of random variables:
γ = g(θ) =
√
θ, θ = g−1
(γ) = γ2
pΘ(θ) = b exp(−bθ), pΓ(γ) =
dg−1
(γ)
dγ
pΘ(g−1
(γ)) = 2γb exp(−bγ2
)
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.46/149
Computing the New MAP Estimate
pY |Γ(y|γ) =
n
Y
i=1
e−γ2 (γ2
)
yi
yi!
, pΓ(γ) = 2γb exp(−bγ2
)
The new logposterior:
H = ln pΓ|Y (γ|y) = −Nγ2
+ (2 ln γ)T + ln γ − bγ2
dH
dγ
= −2Nγ +
2T
γ
+
1
γ
− 2bγ = 0
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.47/149
γ̂2
MAP (y) =
2T + 1
2N + 2b
=
T + 1/2
N + b
6=
T
N + b
= θ̂MAP (y)
γ̂MAP =
r
T + 0.5
N + b
As an aside, recall:
θ̂CME(y) =
T + 1
N + b
6= γ̂CME(y)
If we went through the same exercise for the MMSE esti-
mate, would probably come to similar conclusions!?
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.48/149
In general, for Bayesian estimates f(θ̂(y)) 6= 
f(θ(y)),
whether MAP, MMSE, MAE, or whatever
For the special case of affine transformations,
γ = f(θ) = aθ + b
γ̂ = aθ̂ + b
γ = g(θ) = aθ + b, θ = g−1
(γ) =
γ − b
a
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.49/149
pY |Γ(y|γ) = pY |Θ

y
γ − b
a

, pΓ(γ) =
1
a
pΘ

γ − b
a

MAP estimation:
ln pΓ|Y (γ|y) = ln pY |Θ

y
γ − b
a

+ ln
1
a
+ ln pΘ

γ − b
a

Similar arguments work for MMSE, MAE, etc.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.50/149
1. For ML estimates as N → ∞ âML(r) → a in probability
sense, this is called a consistent estimate.
2. ML estimate is asymptotically efficient.
lim
N→∞
var [âML(r) − a]

−E
n
∂2 ln p(r|a)
∂a2
o−1 = 1
3. ML estimate is asymptotically Gaussian, N(a, σaǫ )
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.51/149
Method of Moments
To find the method of moments estimator of θ1, · · · , θp we
set up and solve the equations:
µ1 (θ1, · · · , θp) = m1
µ2 (θ1, · · · , θp) = m2
µp (θ1, · · · , θp) = mp
The kth sample moment is defined to be:
mk =
1
n
n
X
i=1
xk
i
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.52/149
Example
Let x1, · · · , xn denote a sample from the uniform
distribution from θ1 to θ2.
f (x; θ1, θ2) =
 1
θ2−θ1
θ1 ≤ x ≤ θ2
0 elsewhere
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.53/149
The joint density of x1, · · · , xn is:
L (θ1, θ2) = f (x1, . . . , xn; θ1, θ2) =
n
Y
i=1
f (xi; θ1, θ2)
=
(
1
(θ2−θ1)n θ1 ≤ x1 ≤ θ2, . . . , θ1 ≤ xn ≤ θ2
0 elsewhere
=
(
1
(θ2−θ1)n θ1 ≤ min
i
(xi) , max
i
(xi) ≤ θ2
0 elsewhere
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.54/149
To find the maximum likelihood estimates of θ1 and θ2 we
determine;
∂L (θ1, θ2)
∂θ1
and
∂L (θ1, θ2)
∂θ2
∂L
∂θ1
=
(
n (θ2 − θ1)−n+1
θ1 ≤ min
i
(xi) , max
i
(xi) ≤ θ2
0 elsewhere
∂L
∂θ2
=
(
−n (θ2 − θ1)−n+1
θ1 ≤ min
i
(xi) , max
i
(xi) ≤ θ2
0 elsewhere
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.55/149
Note ∂L(θ1,θ2)
∂θ1
and ∂L(θ1,θ2)
∂θ2
are never equal to zero.
∂L (θ1, θ2)
∂θ1
is always positive and
∂L (θ1, θ2)
∂θ2
is always negative.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.56/149
hence the maximum likelihood estimates of θ1 and θ2 are
θ̂1 = min
i
(xi) , θ̂2 = max
i
(xi)
This compares with the Method of moments estimators:
θ̃1 = x̄ −
v
u
u
t3
n
n
X
i=1
(xi − x̄)2
!
θ̃2 = x̄ +
v
u
u
t3
n
n
X
i=1
(xi − x̄)2
!
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.57/149
The sampling distribution:
θ̂1 = min
i
(xi) , θ̂2 = max
i
(xi)
solution
We use the distribution function method:
θ̂1 = min
i
(xi) = m, θ̂2 = max
i
(xi) = M
G1 (u) = P [m ≤ u] = P
h
min
i
(xi) ≤ u
i
= 1 − P
h
min
i
(xi) ≥ u
i
= 1 − P [x1 ≥ u, · · · , xn ≥ u]
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.58/149
= 1 − P [x1 ≥ u] · · · P [xn ≥ u] = 1 −

θ2 − u
θ2 − θ1
n
Thus the density of
m = θ̂1 = min
i
(xi)
is
g1 (u) = G′
1 (u) = −n

θ2 − u
θ2 − θ1
n−1 
−1
θ2 − θ1

=
n (θ2 − u)n−1
(θ2 − θ1)n
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.59/149
Is m = θ̂1 = min
i
(xi) unbiased?
E [m] = E
h
θ̂1
i
= E
h
min
i
(xi)
i
=
θ2
Z
θ1
ug1 (u) du =
θ2
Z
θ1
u
n (θ2 − u)n−1
(θ2 − θ1)n du
Put v = θ2u then the above integral becomes
0
Z
θ2−θ1
(θ2 − v)
nvn−1
(θ2 − θ1)n − dv
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.60/149
=
n
(θ2 − θ1)n

θ2
θ2−θ1
Z
0
vn−1
dv −
θ2−θ1
Z
0
vn
dv


E

θ̂1

=
n
(θ2 − θ1)n

θ2
θ2−θ1
Z
0
vn−1
dv −
θ2−θ1
Z
0
vn
dv


=
n
(θ2 − θ1)n

θ2
(θ2 − θ1)n
n
−
(θ2 − θ1)n+1
n + 1
#
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.61/149
= θ2 −
n (θ2 − θ1)
n + 1
=
n
n + 1
θ1 +
1
n + 1
θ2
= θ1 +
1
n + 1
(θ2 − θ1)
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.62/149
Is M = θ̂2 = max
i
(xi) unbiased?
E [M] = E
h
θ̂2
i
= E
h
max
i
(xi)
i
=
θ2
Z
θ1
vg2 (v) dv =
θ2
Z
θ1
v
n (v − θ1)n−1
(θ2 − θ1)n dv
Put w = v − θ1 then the above integral becomes
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.63/149
θ2−θ1
Z
0
(w + θ1)
nwn−1
(θ2 − θ1)n dw
=
n
(θ2 − θ1)n


θ2−θ1
Z
0
wn
dw + θ1
θ2−θ1
Z
0
wn−1
dw


E

θ̂2

=
n
(θ2 − θ1)n


θ2−θ1
Z
0
wn
dw + θ1
θ2−θ1
Z
0
wn−1
dw


=
n
(θ2 − θ1)n

(θ2 − θ1)n+1
n + 1
+ θ1
(θ2 − θ1)n
n
#
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.64/149
=
n (θ2 − θ1)
n + 1
+ θ1 =
1
n + 1
θ1 +
n
n + 1
θ2
= θ2 −
1
n + 1
(θ2 − θ1)
E

θ̂2 − θ̂1

= E

θ̂2

− E

θ̂1

=

θ2 − 1
n+1
(θ2 − θ1)

−

θ1 + 1
n+1
(θ2 − θ1)

= 1 − 2
n+1

(θ2 − θ1)
= n−1
n+1

(θ2 − θ1)
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.65/149
Hence
E

n + 1
n − 1
h
θ̂2 − θ̂1
i
= θ2 − θ1
We can use this to get rid of the bias of θ̂1 and θ̂2
T1 = θ̂1 −
1
n + 1
n + 1
n − 1
h
θ̂2 − θ̂1
i
= θ̂1 −
1
n − 1
h
θ̂2 − θ̂1
i
= m −
M − m
n − 1
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.66/149
and
T2 = θ̂2 +
1
n + 1
n + 1
n − 1
h
θ̂2 − θ̂1
i
= θ̂2 +
1
n − 1
h
θ̂2 − θ̂1
i
= M +
M − m
n − 1
Then T1 and T2 are unbiased estimators of θ1 and θ2.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.67/149
Uniformly Better
Let x = (x1, x2, · · · , xn) denote the vector of observations
having joint density f(x|θ) where the unknown parameter
vector θ ∈ Ω. Let T(x) and T∗
(x) be estimators of the
parameter φ(θ). Then T(x) is said to be uniformly better
than T∗
(x) if:
MSET(x) (θ) 6 MSET∗(x) (θ) , θ ∈ Ω
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.68/149
Uniformly Minimum Variance
Unbiased Estimator
Let x = (x1, x2, · · · , xn) denote the vector of observations
having joint density f(x|θ) where the unknown parameter
vector θ ∈ Ω. Then T∗
(x) is said to be the UMVU
(Uniformly minimum variance unbiased) estimator of φ(θ)
if:
E[T∗
(x)] = φ(θ), θ ∈ Ω
Var[T∗
(x)] 6 Var[T(x)]
where E[T(x)] = φ(θ).
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.69/149
Multiple Parameter Estimation
âǫ(r) =





a1(r) − a1
a2(r) − a2
.
.
.
aK(r) − aK





= ~
aǫ(r) − ~
a
Cost function for MSE criterion:
C (âǫ(r)) =
K
X
i=1
â2
ǫi
(r) = âT
ǫ (r)âǫ(r)
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.70/149
Risk:
RMSE =
Z ∞
−∞
Z ∞
−∞
C (âǫ(r)) p (~
r,~
a) d~
rd~
a
=
Z ∞
−∞
p(~
r) d~
r
| {z }
0
Z ∞
−∞
 K
X
i=1
(âi(r) − ai)2
#
p(a|r) d~
a ⇒
âMSEi
(r) =
Z ∞
−∞
ai p(~
a|~
r) d~
a or ˆ
~
aMSEi
(r) =
Z ∞
−∞
~
a p(~
a|~
r) d~
a
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.71/149
The above estimates hold true over linear transformation.
~
b = DL×K
~
a, E

bT
ǫ (r)bǫ(r)

= E
 L
X
i=1
b2
ǫi
(r)
#
~
bMSE(~
r) = D~
aMSE(~
r)
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.72/149
For MAP we find ~
a that max {p(~
a|~
r)}:
∂ ln p(~
a|~
r)
∂ai
a=aMAP(r)
= 0, i = 1, 2, · · · , K
∇a [ln p(~
a|~
r)]|a=aMAP(r) = 0, ∇a =



∂
∂a1
.
.
.
∂
∂aK



For ML:
∇a [ln p(~
r|~
a)]|a=aML(r) = 0
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.73/149
Bias:
E {~
aǫ(~
r)} = E {~
a(~
r) − ~
a} = ~
B(~
r)
If ~
B(~
r) equal to zero then ~
a(~
r) is an unbiased estimate.
For vector variables the quantity analogous to the variance
is the covariance matrix.
E
n
(~
aǫ − E (~
aǫ))T
(~
aǫ − E (~
aǫ))
o
= Λǫ
E (~
aǫ) = ~
B(~
a)
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.74/149
Let’s consider any unbiased estimator of ~
a
σ2
ǫi
= var [ai(~
r) − ai]  Jii
Jii is the ith element in the K × K square matrix J−1
Jij = E

∂ ln p(~
r|~
a)
∂ai
·
∂ ln p(~
r|~
a)
∂aj

This is called Fisher’s information matrix (FIM).
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.75/149
For any estimator we are interested in:
1. Bias:
E {~
a(r)}
2. Error cross-covariance:
E
n
(~
aǫ − E (~
aǫ))T
(~
aǫ − E (~
aǫ))
o
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.76/149
Example:
Consider the random variable Y with
E[Y ] = g(U1, U2, · · · , Uk) =
p
X
i=1
βiφi(U1, U2, · · · , Uk)
and
var(Y ) = σ2
where βi, i = 1, 2, · · · , p are unknown parameters.
where φi, i = 1, 2, · · · , p are known functions of the
nonrandom variables Ui, i = 1, 2, · · · , k
assume further that Y is normally distributed.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.77/149
Thus the PDF of Y is:
f(Y |β1, · · · , βp, σ2
) = f(Y |β, σ2
)
=
1
√
2πσ2
exp

−
1
2s2
[Y − g(U1, U2, · · · , Uk)]2

=
1
√
2πσ2
exp



−
1
2σ2

Y −
p
X
i=1
βiφi (U1, U2, · · · , Uk)
#2



=
1
√
2πσ2
exp

−
1
2σ2
[Y − β1X1 − β2X2 − · · · − βpXp]2

whereXi = φi (U1, U2, · · · , Uk) , i = 1, 2, · · · , p.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.78/149
Now suppose that n independent observations of Y
(y1, y2, · · · , yn)
corresponding to n sets of values of





(u11, · · · , u1k)
(u21, · · · , u2k)
.
.
.
(un1, · · · , unk)





Let xij = φj(ui1, · · · , uik), j = 1, 2, · · · , p, i = 1, 2, · · · , n.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.79/149
Then the joint density of y = (y1, y2, · · · , yn) is:
f(y1, · · · , yn|β1, · · · , βp, σ2
) = f(y|β, σ2
)
=
1
(2πσ2)n/2
exp
(
−
1
2σ2
n
X
i=1
[yi − g(u1i, u2i, ..., uki)]2
)
=
1
(2πσ2)n/2
exp



−
1
2σ2
n
X
i=1

yi −
p
X
j=1
βjφj(u1i, u2i, ..., uki)
#2



AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.80/149
=
1
(2πσ2)n/2
exp



−
1
2σ2
n
X
i=1

yi −
p
X
j=1
βjxij
#2



=
1
(2πσ2)n/2
exp

−
1
2σ2
[y − Xβ]′
[y − Xβ]

=
1
(2πσ2)n/2
exp

−
1
2σ2
[y′
y − 2y′
Xβ + β′
X′
Xβ]

=
1
(2πσ2)n/2
exp

−
1
2σ2
[β′
X′
Xβ]

exp

−
1
2σ2
[y′
y − 2y′
Xβ]

AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.81/149
= h (y) g β, σ2

exp

−
1
2σ2
[y′
y − 2y′
Xβ]

Thus f(y|β, σ2
) is a member of the exponential family of
distributions and
S = (y′
y, X′
y)
is a Minimal Complete set of Sufficient Statistics.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.82/149
The Maximum Likelihood estimates of β and σ2
are the
values
β̂ and σ̂2
that maximize
Ly σ2
, β

=
1
(2πσ2)n/2
exp

−
1
2σ2
[y − Xβ]′
[y − Xβ]

or equivalently
ly σ2
, β

= ln Ly σ2
, β

AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.83/149
= −
n
2
ln (2π) −
n
2
ln σ2

−
1
2σ2
[y − Xβ]′
[y − Xβ]
= −
n
2
ln (2π) −
n
2
ln σ2

−
1
2σ2
[y′
y − 2y′
Xβ + β′
X′
Xβ]
∂ly (σ2
, β)
∂β
= 0
yields the system of linear equations (The Normal
Equations)
X′
Xβ̂ = X′
y
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.84/149
while
∂ly (σ2
, β)
∂σ2
= 0
yields the equation:
σ̂2
=
1
n
h
y − Xβ̂
i′ h
y − Xβ̂
i
If [X′
X]−1
exists then the normal equations have solution:
β̂ = (X′
X)
−1
X′
y
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.85/149
and
σ̂2
=
1
n
h
y − Xβ̂
i′ h
y − Xβ̂
i
=
1
n
h
y − X (X′
X)
−1
X′
y
i′ h
y − X (X′
X)
−1
X′
y
i
=
1
n
h
y′
y − y′
X (X′
X)
−1
X′
y
i
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.86/149
Almost all problems in statistics can be formulated as a
problem of making a decision . That is given some data
observed from some phenomena a decision will have to be
made about the phenomena. Decisions are generally
broken into two types :
Estimation decisions
Hypothesis Testing decisions.
Probability Theory plays a very important role in these de-
cisions.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.87/149
Besides the Normal distribution the following distributions
play an important role in estimation and hypothesis testing:
Chi-squared distribution with ν degrees of freedom
f(x) =
1
Γ(ν/2)2ν/2
x(ν−2)/2
e−x/2
, x  0
Comment: If z1, z2, · · · , zν are independent random
variables each having a standard normal distribution
then U =
Pν
k=1 z2
k has a chi-squared distribution with ν
degrees of freedom.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.88/149
F distribution with ν1 degrees of freedom in the
numerator and ν2 degrees of freedom in the
denominator
f(x) = Kx(ν1−2)/2

1 +
ν1
ν2
x
−(ν1+ν2)/2
, x  0, K =
Γ(ν1+ν2
2
)

ν1
ν2
(ν1/2)
Γ(ν1/2)Γ(ν2/2)
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.89/149
Comment: If U1 and U2 are independent random variables
each having Chi-squared distribution with ν1 and ν2
degrees of freedom respectively then
F =
U1
U2
ν1
ν2
has a F distribution with ν1 degrees of freedom in the nu-
merator and ν2 degrees of freedom in the denominator.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.90/149
The t distribution with ν degrees of freedom
f(x) = K

1 +
x2
ν
−(ν+1)/2
, K =
Γ((ν + 1)/2)
Γ(ν/2)
√
πν
Comment: If Z and U are independent random variables,
and Z has a standard Normal distribution while U has a
Chi-squared distribution with ν degrees of freedom then
t =
Z
p
U/ν
has a t distribution with ν degrees of freedom.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.91/149
Goal: Extract useful information out of messy data
Strategy: Formulate probabilistic model of data y, which
depends on underlying parameter(s) θ
Terminology depends on parameter space:
Detection (simple hypothesis testing):
θ ∈ {0, 1}, 0 = target absent, 1 = target present
Classification (multihypothesis testing):
θ ∈ {0, 1, · · · , M}, , i.e.θ ∈ {DC-9, 747, F-15, MiG-31}
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.92/149
Termonology
Suppose θ = (θ1, θ2)
If we are only interested in θ1, then θ2 are called
nuisance parameters
If θ1 = {0, 1}, and θ2 are nuisance parameters, we call
it a composite hypothesis testing problem
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.93/149
Ex: Positron Emission Tomography
Simple, traditional linear DSP-based approach Filtered
Back Projection (FBP)
Advanced, estimation-theoretic approach
Model Poisson “likelihood” of collected data
Markov Random Field (MRF) “prior”on image
Find estimate using expectation-maximization
algorithm (or similar technique)
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.94/149
Tasks of Statistical Signal
Processing: Estimation, Detection,
. . .
1. Create statistical model for measured data
2. Find fundamental limitations on our ability to perform
inference on the data
(a) Cramér-Rao bounds, Chernov bounds, etc.
3. Develop an optimal (or suboptimal) estimator
4. Asymptotic analysis (i.e., assume we have lots and lots
of data) of estimator performance to see if it
approaches bounds derived in (2)
5. Do simulations and experiments comparing algorithm
performance to lower bounds and competing
algorithms
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.95/149
A Bayesian analysis treats θ as a random variable with
a “prior” density p(θ)
Data generating machinery is specified by a
conditional density p(y|θ)
Gives the “likelihood” that the data y resulted from
the parameters ?
Inference usually revolves around the posterior density,
derived from Bayes’ theorem:
p(θ|y) =
p(y|θ)p(θ)
p(y)
=
p(y|θ)p(θ)
R
p(y|θ)p(θ)dθ
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.96/149
Classical detection problem:
Design of optimum procedures for deciding
between possible statistical situations given a
random observation:
H0 : Yk ∼ P ∈ P0, k = 1, · · · , n
H1 : Yk ∼ P ∈ P1, k = 1, · · · , n
The model has the following components:
Parameter Space (for parametric detection
problems)
Probabilistic Mapping from Parameter Space to
Observation Space
Observation Space AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.97/149
Parameter Space:
Completely characterizes the output given the
mapping.
Each hypothesis corresponds to a point in the
parameter space. This mapping is one-to-one.
Probabilistic Mapping from Parameter Space to
Observation Space:
The probability law that governs the effect of a
parameter on the observation.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.98/149
Example:
Yk =



Nk
Nk
Nk
, p = 1/2,
, p = 1/4,
, p = 1/4,
Nk ∼ N(0, σ2
)
Nk ∼ N(−1, σ2
)
Nk ∼ N(1, σ2
)
µ =

−1 0 1
T
| {z }
parameter space
p = 1/2, 1/4, 1/4 is the probabilistic mapping.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.99/149
Observation Space:
Finite dimensional, i.e. k = 1, 2, · · · , n where n is
finite.
Detection Rule
Mapping of the observation space into its
parameters in the parameter space is called a
detection rule.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.100/149
Classical estimation problem:
Interested in not making a choice among several
discrete situations, but rather making a choice
among a continuum of possible states.
Think of a family of distributions on the observation
space, indexed by a set of parameters.
Given the observation, determine as accurately as
possible the actual value of the parameter.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.101/149
Example:
Yk = Nk, Nk(µ, σ2
)
In this example, given the observations, parameter µ is
being estimated. Its value is not chosen among a set of
discrete values, but rather is estimated as accurately as
possible.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.102/149
Estimation problem also has the same components as
the detection problem.
Parameter Space Probabilistic Mapping from
Parameter Space to Observation Space
Observation Space
Estimation Rule
Detection problem can be thought of as a special case
of the estimation problem. There are a variety of
estimation procedures differing basically in the amount
of prior information about the parameter and in the
performance criteria applied. Estimation theory is less
structured than detection theory. Detection is science,
estimation is art(I have seen it in a book “Array signal
processing” by Johnson, Dudgeon).
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.103/149
Based on the a priori information about the parameter,
there are two basic approaches to parameter
estimation:
Bayesian Parameter Estimation
Nonrandom Parameter Estimation
Bayesian Parameter Estimation:
Parameter is assumed to be a random quantity
related statistically to the observation.
Nonrandom Parameter Estimation:
Parameter is a constant without any probabilistic
structure.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.104/149
Estimation theory relies on jargon to characterize the
properties of estimators.
The following definitions are used: The set of n
observations are represented by the n-dimensional
vector y ∈ Γ (observation space).
The values of the parameters are denoted by the
vector θ ∈ Λ (parameter space).
The estimate of this parameter vector is denoted by
: Γ → Λ.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.105/149
Definitions (continued): The estimation error ε(y) (ε in
short) is defined by the difference between the
estimate and the actual parameter:
ε(y) = θ̂(y) − θ
The function C(a, θ) is the cost of estimating a true
value of θ as a.
Given such a cost function C, the Bayes risk
(average risk) of the estimator is defined by the
following:
r(θ̂) = E
n
E
n
C[θ̂(Y), Θ]
y
oo
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.106/149
Example
Suppose we would like to minimize the Bayes risk defined
by
r(θ̂) = E
n
E
n
C[θ̂(Y), Θ]
y
oo
for a given cost function C. By inspection, one can see that
the Bayes estimate of θ can be found (if it exists) by
minimizing, for each y ∈ Γ, the posterior cost given Y = y:
E
n
C[θ̂(Y), Θ]
y
o
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.107/149
An estimate is said to be unbiased if the expected value of
the estimate equals the true value of the parameter
E
n
θ̂|θ
o
= θ
Otherwise the estimate is said to be biased. The bias b(θ) is
usually considered to be additive, so that:
b(θ) = E
n
θ̂|θ
o
− θ
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.108/149
An estimate is said to be asymptotically unbiased if the
bias tends to zero as the number of observations tend
to infinity.
An estimate is said to be consistent if the
mean-squared estimation error tends to zero as the
number of observations becomes large.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.109/149
An efficient estimate has a mean-squared error that
equals a particular lower bound: the Cramer-Rao
bound. If an efficient estimate exists, it is optimum in
the mean-squared sense: No other estimate has a
smaller mean-squared error.
Following shorthand notations will also be used for
brevity:
pθ(y) = py|θ(y|θ) = Probability density(y given θ)
Eθ{y} = E{y|θ}
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.110/149
Following definitions and theorems will be useful later
in the presentation:
Definition: Sufficiency
Suppose that Λ is an arbitrary set. A function T :
Γ → Λ is said to be a sufficient statistic for the
parameter set θ ∈ Λ if the distribution of y
conditioned on T(y) does not depend on θ for
θ ∈ Λ.
If knowing T(y) removes any further dependence
on θ of the distribution of y, one can conclude
that T(y) contains all the information in y that is
useful for estimating θ. Hence, it is sufficient.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.111/149
Definition: Minimal Sufficiency
A function T on Γ is said to be minimal sufficient
for the parameter set θ ∈ Λ if it is a function of
every other sufficient statistic for θ.
A minimal sufficient statistic represents the
furthest reduction in the observation without
destroying information about θ.
Minimal sufficient statistic does not necessarily
exist for every problem. Even if it exists, it is
usually very difficult to identify it.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.112/149
Let {x1, x2, · · · , xn} denote a set of observations with joint
density f(x1, x2, · · · , xn; θ1, θ2, · · · , θp) then the set of statis-
tics: S1 = S1(x1, x2, · · · , xn), . . . , Sq = Sq(x1, x2, · · · , xn)
called a set of sufficient statistics if the conditional distri-
bution of {x1, x2, · · · , xn} given S1, S2, · · · , Sq is functionally
independent of the parameters {θ1, θ2, · · · , θp}.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.113/149
Example
Suppose that we observe a success-failure experiment
(Bernoulli trial) n = 3 times.
Let π denote the probability of success. Let x1, x2, x3
denote the observations where
xi =

1 if the ith trial is a success
0 if the ith trial is a failure
Let S = x1 + x2 + x3 = the total number of successes.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.114/149
Joint distribution of x1, x2, x3 ; sampling distribution of S,
conditional distribution of x1, x2, x3 given S.
x1, x2, x3 f(x1, x2, x3; π) S g(S; π) x1, x2, x3|S
0, 0, 0 (1 − π)3
0 (1 − π)3
1
1, 0, 0 π(1 − π)2
1/3
0, 1, 0 π(1 − π)2
1 3π(1 − π)2
1/3
0, 0, 1 π(1 − π)2
1/3
1, 1, 0 π2
(1 − π) 1/3
1, 0, 1 π2
(1 − π) 2 3π2
(1 − π) 1/3
0, 1, 1 π2
(1 − π) 1/3
1, 1, 1 π3
3 π3
1
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.115/149
The data x1, x2, x3 can be thought to be generated in two
ways
1. Generate the data x1, x2, x3 directly from the joint
density f(x1, x2, · · · , xn; θ1, θ2, · · · , θp) or
2. Generate the sufficient statistics S1, S2, · · · , Sq from
their joint sampling distribution then generate the
observations x1, x2, x3 from the conditional distribution
of x1, x2, x3 given S1, S2, · · · , Sq. Since the second step
is independent of the parameters θ1, θ2, · · · , θp all of the
information about the parameters will be determined by
the results of the first step.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.116/149
Principle of sufficiency
Any decision about the parameters θ1, θ2, · · · , θp should
be made using the values of the sufficient statistics
S1, S2, · · · , Sq and not otherwise on the data x1, x2, x3.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.117/149
The Likelihood Principle
Any decision about the parameters θ1, θ2, · · · , θp should
be made using the values of the Likelihood function
L(θ1, θ2, · · · , θp) and not otherwise on the data x1, x2, x3.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.118/149
x1, x2, x3 f(x1, x2, x3; π) S g(S; π) L(π)
0, 0, 0 (1 − π)3
0 (1 − π)3
(1 − π)3
1, 0, 0 π(1 − π)2
0, 1, 0 π(1 − π)2
1 3π(1 − π)2
π(1 − π)2
0, 0, 1 π(1 − π)2
1, 1, 0 π2
(1 − π)
1, 0, 1 π2
(1 − π) 2 3π2
(1 − π) π2
(1 − π)
0, 1, 1 π2
(1 − π)
1, 1, 1 π3
3 π3
π3
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.119/149
S L(π)
0
1
2
3
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.120/149
Let x1, x2, x3 denote a set of observations with joint density
f(x1, x2, · · · , xn; θ1, θ2, · · · , θp). Then S1 = S1(x1, · · · , xn),
. . . , Sq = Sq(x1, · · · , xn) are a set of sufficient statistics if
the joint density satisfies:
f(x1, x2, · · · , xn; θ1, θ2, · · · , θp) = g(S1, S2, · · · , Sq; θ1, θ2, · · · , θp)h(x1, · · · , xn)
i. e.dependence on the parameters factors out with the
Sufficient statistics.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.121/149
Example
Let x1, x2, · · · , xn denote a sample from the normal
distribution with mean µ and variance σ2
. The density of xi
is:
f (xi) =
1
√
2πσ
e− 1
2σ2 (xi−µ)2
And the joint density of (x1, x2, · · · , xn) is:
f x1, . . . , xn; µ, σ2

=
n
Y
i=1
1
√
2πσ
e− 1
2σ2 (xi−µ)2
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.122/149
f x1, . . . , xn; µ, σ2

=
1
(2πσ2)
n/
2
e
− 1
2σ2
n
P
i=1
(xi−µ)2
=
1
(2πσ2)
n/
2
e
− 1
2σ2
 n
P
i=1
x2
i −2µ
n
P
i=1
xi+nµ2

n
X
i=1
x2
i =
n
X
i=1
(xi − x̄)2
+ nx̄2
= (n − 1) s2
+ nx̄2
n
X
i=1
xi = nx̄
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.123/149
f x1, . . . , xn; µ, σ2

=
1
(2πσ2)
n/
2
e− 1
2σ2 ((n−1)s2+nx̄2−2nµx̄+nµ2
)
= h (x1, . . . , xn) g x̄, s; µ, σ2

where
g x̄, s; µ, σ2

=
1
(2πσ2)
n/
2
e− 1
2σ2 ((n−1)s2+nx̄2−2nµx̄+nµ2
)
h (x1, . . . , xn) = 1
Thus, x̄ and s are sufficient statistics.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.124/149
The Factorization Theorem:
Suppose that the parameter set θ ∈ Λ has a
corresponding families of densities pθ. A statistic
T is sufficient for θ iff there are functions gθ and h
such that
pθ = gθ[T(y)]h(y)
∀y ∈ Γ and θ ∈ Λ.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.125/149
Example
Consider the hypothesis-testing problem Λ = {0, 1} with
densities p0 and p1. Noting that
pθ(y) =
(
p0(y) ifθ = 0
p1(y)
p0(y)
p0(y) ifθ = 1,
the factorization pθ = gθ[T(y)]h(y) is possible with
h(y) = p0(y)
T(y) = p1(y)/p0(y) ≡ L(y)
gθ(y) =

1 ifθ = 0
t ifθ = 1.
Thus the likelihood ratio L is a sufficient statistic for the bi-
nary hypothesis-testing problem. AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.126/149
Rao-Blackwell Theorem
Suppose that g(y) is an unbiased estimate of g(θ) and that
T is sufficient for θ. Define
g̃[T(y)] = Eθ{ĝ(Y)|T(Y) = T(y)}
Then g̃[T(y)] is also an unbiased estimate of g(θ).
Furthermore
Varθ(g̃[T(Y)]) ≤ Varθ(ĝ(Y)),
with equality iff
Pθ(ĝ(Y) = g̃[T(Y)]) = 1.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.127/149
Let (x1, x2, · · · , xn) denote a set of observations with joint
density f(x1, x2, · · · , xn; θ1, θ2, · · · , θp). Let
S1 = S1(x1, x2, · · · , xn), . . . , Sq = Sq(x1, x2, · · · , xn) denote
a set of sufficient statistics. Let t(x1, x2, · · · , xn) be any
unbiased estimator of the parameter φ = g(θ1, θ2, · · · , θp)
then there exists an unbiased estimator, T(S1, · · · , Sq) of φ
such that
Var(T) 6 Var(t)
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.128/149
Proof
Let
T (S1, . . . , Sk) = E (t (x1, . . . , xn) |S1, . . . , Sk )
is the Conditional Expectation of tgivenS1, · · · , Sk
T (S1, . . . , Sk) =
Z
. . .
Z
t (x1, . . . , xn)g (x1, . . . , xn |S1, . . . , Sk ) dx1 . . . dxn
Now t is an unbiased estimator of φ = g(θ1, · · · , θp) Hence
E [t] = φ
Also
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.129/149
E [t] = ES1,...,Sk
[E [t |S1, . . . , Sk ]]
= ES1,...,Sk
[T (S1, . . . , Sk)] = φ
Thus T is also an unbiased estimator of
φ = g(θ1, θ2, · · · , θp)
Finally
V ar [t] = V arS1,...,Sk
[E [t |S1, . . . , Sk ]]
+ES1,...,Sk
[V ar [t |S1, . . . , Sk ]] ≥ V arS1,...,Sk
[T (S1, . . . , Sk)]
Since
ES1,...,Sk
[V ar [t |S1, . . . , Sk ]] ≥ 0
QED.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.130/149
The Rao-Blackwell theorem states that if you have any unbi-
ased estimator t of a parameter (that depends arbitrarily on
the observations) then you can find a better unbiased es-
timator (smaller variance) that is a function of the solely of
sufficient statistics. Thus the best unbiased estimator (min-
imum variance) has to be a function of the sufficient statis-
tics
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.131/149
Thus the search for the UMVU estimator (uniformly min-
imum variance unbiased estimator) is amongst functions
that depend solely on the sufficient statistics.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.132/149
Example
Suppose that Γ = {0, 1, · · · , n}, Λ = {0, 1}, and
pθ(y) = n!
y!(n−y)!
θy
(1 − θ)n−y
, y = 0, . . . , n, 0  θ  1
For any function f on Γ, we have
Eθ{f(Y )} =
n
P
y=0
n!
y!(n−y)!
f(y)θy
(1 − θ)n−y
= (1 − θ)n
n
P
y=0
ayxy
The condition Eθf(Y ) = 0 ∀θ ∈ Γ implies that
n
P
y=0
ayxy
= 0, ∀x  0.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.133/149
However, an nth order polynomial has at most n zeros un-
less all of its coefficients are zero. Hence, θ ∈ Γ is com-
plete.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.134/149
Let (x1, x2, · · · , xn) denote a set of observations with joint
density f(x1, x2, · · · , xn; θ1, θ2, · · · , θp) then the set of
statistics:
Let S1 = S1(x1, x2, · · · , xn), . . . , Sq = Sq(x1, x2, · · · , xn)
denote a set of sufficient statistics
Then S1, · · · , Sq are called a set of complete sufficient
statistics if whenever
E(h(S1, · · · , Sq)) = 0 ⇒ h(S1, · · · , Sq) = 0
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.135/149
If S1 = S1(x1, x2, · · · , xn), . . . , Sq = Sq(x1, x2, · · · , xn)
denote a set of sufficient statistics The S1, · · · , Sq are
called a set of complete sufficient statistics if whenever
E(h(S1, · · · , Sq)) = 0 ⇒ h(S1, · · · , Sq) = 0
i.e.,
Z
· · ·
Z
h (S1, . . . , Sk)g (S1, . . . , Sk |θ1, . . . , θp ) dS1 . . . dSk = 0
implies
h (S1, . . . , Sk) = 0
Completeness is sometimes difficult to prove.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.136/149
Example
Suppose that we observe a success-failure experiment
(Bernoulli trial) n = 3 times.
Let π denote the probability of success. Let x1, x2, x3
denote the observations where
xi =

1 if the ith trial is a success
0 if the ith trial is a failure
Let S = x1 + x2 + x3 = the total number of successes.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.137/149
Joint distribution of x1, x2, x3 ; sampling distribution of S,
conditional distribution of x1, x2, x3 given S.
x1, x2, x3 f(x1, x2, x3; π) S g(S; π) x1, x2, x3|S
0, 0, 0 (1 − π)3
0 (1 − π)3
1
1, 0, 0 π(1 − π)2
1/3
0, 1, 0 π(1 − π)2
1 3π(1 − π)2
1/3
0, 0, 1 π(1 − π)2
1/3
1, 1, 0 π2
(1 − π) 1/3
1, 0, 1 π2
(1 − π) 2 3π2
(1 − π) 1/3
0, 1, 1 π2
(1 − π) 1/3
1, 1, 1 π3
3 π3
1
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.138/149
S is a sufficient statistic.
Is it a complete sufficient statistic?
sampling distribution of S
S g(S; π)
0 (1 − π)3
1 3π(1 − π)2
2 3π2
(1 − π)
3 π3
E [h (S)] = h (0) (1 − π)3
+ h (1) 3π (1 − π)2
+ h (2) 3π2
(1 − π) + h (3) π3
=
= h (0) + 3 [h (1) − h (0)] π + 3 [h (0) − 2h (1) + h (2)] π2
+ [h (3) − 3h (2) + 3h (1) + h (0)] π3
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.139/149
E [h (S)] = 0, for all values of π
i.e.,
p (π) = h (0) + 3 [h (1) − h (0)] π + 3 [h (0) − 2h (1) + h (2)] π2
+ [h (3) − 3h (2) + 3h (1) + h (0)] π3
= 0, ⇒
h (0) = 0,
3 [h (1) − h (0)] = 0,
h (0) − 2h (1) + h (2) = 0,
h (3) − 3h (2) + 3h (1) + h (0) = 0
Thus , h (0) = h (1) = h (2) = h (3) = 0
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.140/149
S is a complete sufficient statistic.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.141/149
If S1, · · · , Sq are called a set of complete sufficient statistics
and T1 = h1(S1, · · · , Sq) and T2 = h2(S1, · · · , Sq) Are
unbiased estimators of φ.
Then T1 = T2 E(T1) = E(T2) = φ hence E(T1 − T2) = 0.
and h2(S1, · · · , Sq) − h1(S1, · · · , Sq) and T2 = T1
Thus there is only one unbiased estimator of φ that is a
function of complete sufficient statistics
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.142/149
Lehmann-Scheffe Theorem
Let x1, x2, · · · , xn denote a set of observations
with joint density f(x1, · · · , xn; θ1, · · · , θq). Let
S1 = S1(x1, · · · , xn),. . . , Sq = Sq(x1, · · · , xn) denote a
set of complete sufficient statistics.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.143/149
Let T(S1, · · · , Sq) be an unbiased estimator of the
parameter φ = g(θ1, · · · , θq) then T is the uniform minimum
variance unbiased (UMVU) estimator of φ. That is if
t(x1, x2, · · · , xn) is an unbiased estimator of φ then
Var(T) 6 Var(t)
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.144/149
Example
We observe a success-failure experiment (Bernoulli trial)
n = 3 times. the probability of success is π.
Let x1, x2, x3 denote the observations where
xi =

1 if the ith trial is a success
0 if the ith trial is a failure
Let S = x1 + x2 + x3 = the total number of successes.
E
1
3
S

= E
1
3
(x1 + x2 + x3)

= 1
3
(E [x1] + E [x2] + E [x3])
= 1
3
(π + π + π) = π
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.145/149
S
3
is an unbiased estimator of π.
S
3
is the uniform minimum variance unbiased (UMVU) estimator of π.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.146/149
The strategy to find the UMVU estimator
1. Find a set of complete sufficient statistics
S1, S2, · · · , Sk.
2. To find an unbiased estimator that depends only on the
set of complete sufficient statistics
T(S1, S2, · · · , Sk).
3. Lehman Scheffe theorem.
Maximum Likelihood estimators are functions of a set of
complete sufficient statistics S1, S2, · · · , Sk.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.147/149
Factorization criterion
L(θ1, · · · , θp) = f(x1, · · · , xn; θ1, · · · , θp)
= g(S1, · · · , Sk; θ1, · · · , θp)h(x1, · · · , xn)
θ1, · · · , θp will maximize L(θ1, · · · , θp) if
θ1, · · · , θp will maximize g(S1, · · · , Sk; θ1, · · · , θp)
These will depend on S1, · · · , Sk.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.148/149
1. Finding Maximum Likelihood estimators.
2. Checking if there is a set of complete sufficient
statistics -S1, · · · , Sk.
3. Checking if the maximum Likelihood estimators they
are unbiased.
4. Making adjustments to these estimators if they are not
unbiased.
This is the standard way of finding UMVU estimators
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.149/149
C
a
r
e
f
u
l
Being Careful
• Suppose Y has the density
p(y; µ) =
n
Y
i=1
1
aσc
exp

−

yi − µ
aσ
4
#
where a ≈ 1.4464 and c is a constant which
makes the density normalize to 1.
l(µ) = −
n
X
i=1

yi − µ
aσ
4
• Try usual “set derivative equal to zero”
dl(µ)
dµ
= 4
n
X
i=1

yi − µ
aσ
3
= 0
• This will be a 3rd order polynomial in µ,
which will in general have 3 solutions
• Would have to just compute the
loglikelihood for each solution and see
which one gives the biggest result
• Suppose Y = Y1, · · · , Yn, is i.i.d. with a
Cauchy-like density (but not really
1
C
a
r
e
f
u
l
Cauchy!) (Maximum Penalized Likelihood
Estimation)
p(y; c, γ) =
n
Y
i=1
γ
2 (γ + |yi − c|)
2
• Loglikelihood:
l(c, γ) = n ln γ − 2
n
X
i=1
ln (γ + |yi − c|)
l(y; c, γ) = n ln γ − 2
X
i∈{yi≥c}
ln[γ + (yi − c)]
−2
X
i∈{yic}
ln[γ + (c − yi)]
• Suppose γ is given and we want to estimate
c taking derivative:
dl(y; c)
dc
= −2
X
i∈{yi≥c}
−1
γ + yi − c
−2
X
i∈{yi≥c}
1
γ + c − yi
= 2
n
X
i=1
sign(yi − c)
γ + |yi − c|
• We would be tempted to say that the ML
2
C
a
r
e
f
u
l
estimator of c is just the solution of
dl(y; c)
dc
= 2
n
X
i=1
sign(yi − c)
γ + |yi − c|
= 0
• Well. . . theres actually more than one
solution to this, so how about picking the
solution which gives the greatest
likelihood?
• but that is really a trap!!!
• Let’s check out the second derivative:
d2
l(y; c)
dc2
= 2
n
X
i=1
[sign(yi − c)]2
(γ + |yi − c|)
2  0
• Those critical points were really
local minima!!!
• Where is the real maxima?
• Notice that |x| is not differentiable at
x = 0, hence
p(y; c, γ) = n ln γ − 2
n
X
i=1
ln (γ + |yi − c|)
3
C
a
r
e
f
u
l
is not differentiable at c=any of the data
points!
• To get the real ML estimate, try each yi for
c, and see which one gives the biggest
likelihood
• “It turns out” the ML estimate is one of
the median points
4
E
M
Expectation-Maximization Algorithm
• The EM procedure is a way of making
iterative algorithms for maximizing
loglikelihoods or Bayesian logposteriors
when no closed-form solution is available
• There’s a more powerful and more general
EM formulation by Csiszar based on
Information Theory EM Algorithm
• Incomplete data Y, that we actually
measure
• Goal: maximize the
incomplete data loglikelihood(function of
specific collected data)
lid(θ) = log pY (y; θ)
• Complete data Z, a hypothetical data set
• Tool: complete data loglikelihood (function
of complete data as a random variable)
lcd(θ) = log pZ(z; θ)|z=Z = log pZ(Z; θ)
• Complete data space must be “larger” and
1
E
M
determine the incomplete data, i.e. there
must be a many-to-one mapping y = h(z)
The EM Recipe
• Step 1: Decide on a complete data space
• Step 2: The expectation step
Q(θ; θ̂old
) = E[lcd(θ)|Y = y; θ̂old
]
• Step 3: The maximization step
θ̂new
= arg max
θ
Q(θ; θ̂old
)
• Start with a feasible initial guess θ̂old
then
iterate steps 2 and 3 (which can usually be
combined)
What is that Expectation?
E[lcd|Y = y; θ̂old
] =
Z
pZ|Y (z|y; θ̂old
) log pZ(z; θ)dz
pZ|Y (z|y; θ̂old
) =





pZ (z;θ̂old
)
R
Z(y)
pZ (z̃;θ̂old)dz̃
z ∈ Z(y)
0 z /
∈ Z(y)
2
E
M
Z
Z(y)
pZ(z; θ̂old
)dz = pY (y; θ̂old
)
Aspects of EM Algorithms
• Incomplete data loglikelihood is guaranteed
to increase with each EM iteration
• Must be careful; might converge to a local
maxima which depends on the starting
point
• Often, the estimates naturally stay in the
feasible space (i.e., nonnegativity
constraints
• In many problems, a candidate complete
data space naturally suggests itself
Ex: Poisson Signal in Additive Poisson
Noise
Y = S + N
S ∼ Poisson(θ), N ∼ Poisson(λN ),
• Incomplete-data loglikelihood is
lid(θ) = −(θ + λN ) + y ln(θ + λN ),
3
E
M
• ML estimator can be found in closed form
θ̂(y) = max(0, y − λN )
Choose the Complete Data
• Can often choose the complete data in
several different ways; try to choose to
make remaining steps easy
• Different choices lead to different
algorithms; some will converge “faster”
than others.
• Here, take complete data to be Z = (S, N);
suppose we could magically measure the
signal and noise counts separately!
• Complete data loglikelihood is:
lcd(θ) = [−θ + S ln(θ)] + [−λN + N ln(λN )]
The E-Step
Q(θ; θ̂old
) = E[lcd(θ)|Y = y; θ̂old
]
= E[−(θ + λN ) + S ln(θ) + N ln(λN )|y; θ̂old
]
= −(θ + λN ) + E[S|y; θ̂old
] ln(θ)
4
E
M
+E[N|y; θ̂old
] ln(λN )
• Often convenient to leave explicit
computation of conditional expectation
until the last minute
• As with loglikelihoods, we sometimes drop
terms which are constants w.r.t. θ
The M-Step
θ̂new
= arg max
θ≥0
Q(θ; θ̂old
)
• Take derivative as usual
d
dθ
Q(θ; θ̂old
) = −1 +
E[S|y; θ̂old
]
θ
• Setting equal to zero yields
θ̂new
= E[S|y; θ̂old
]
• Now we just have to compute that
expectation. (That’s usually the hardest
part.)
That Conditional Expectation
E[S|y; θ̂old
] =
Z
spS(s|y; θ̂old
)ds
5
E
M
• Let’s look at the conditional density:
pS(s|y; θ̂old
) =
pY |S(y|s; θ̂old
)pS(s; θ̂old
)
pY (y; θ̂old)
=
exp[−λN ]λy−s
N
(y−s)! I(y ≥ s)exp[−θ̂old
](θ̂old
)s
s!
exp[−(θ̂old + λN )](θ̂old + λN )y/y!
=
y!
s!(y − s)!
λy−s
N
(θ̂old + λN )y−s
(θ̂old
)s
(θ̂old + λN )s
I(s ≤ y)
• We observe that the conditional density is
just binomial. For 0 6 s 6 y,
pS (s|y; θ̂
old
) =


y
s




θ̂old
θ̂old + λN


s
λN
θ̂old + λN
!y−s
E[S|y; θ̂old
] = y
θ̂old
θ̂old + λN
• So this particular EM algorithm is:
θ̂new
= E[S|y; θ̂old
] = y
θ̂old
θ̂old + λN
• Let’s see if our analytic formula for the
maximizer, θ̂ = max(0, y − λN ), is a fixed
point for the EM iteration
6
E
M
• For y  λN ,
θ̂new
= y
θ̂old
θ̂old + λN
y − λN = y
y − λN
y − λN + λN
y − λN = y − λN
• For y  λN , immediately get 0=0
• So everything is good
Back in Bayesian land
• EM also good for MAP estimation; just
add the logprior to the Q-function
θ̂new
= arg max
θ≥0
QP (θ; θ̂old
)
QP (θ; θ̂old
) = E[lcd|Y = y; θ̂old
] + log p(θ)
• Consider previous example, with an
exponential prior with mean 1/a
QP (θ; θ̂old
) = −θ + E[S|y; θ̂old
] ln(θ) − aθ
QP (θ; θ̂old
)
dθ
= −1 +
E[S|y; θ̂old
]
θ
− a
7
E
M
θ̂old
=
E[S|y; θ̂old
]
1 + a
=
θ̂old
θ̂old + λN
!
y
1 − a
Expectation-Maximization Algorithm
(Theory) Convergence of the EM
Algorithm
• We’d like to prove that the likelihood goes
up with each iteration:
Lid(θ̂new
) ≥ Lid(θ̂old
)
• Recall from last pages:
pZ|Y (z|y; θ) =
pZ(z; θ)
pY (y; θ)
z ∈ Z(y) = {z : h(z) = y}
ln pY (y; θ) = ln pZ(z; θ) − ln pZ|Y (z|y; θ)
• Multiply both sides by the same thing and
integrate with respect to z :
R
pZ|Y (z|y; θ̂old
) ln pY (y; θ)dz
=
R
pZ|Y (z|y; θ̂old
) ln pZ(z; θ)dz
−
R
pZ|Y (z|y; θ̂old
) ln pZ|Y (z|y; θ)dz
8
E
M
• simplifies to:
Lid(θ) = Q(θ; θ̂old
)−
Z
pZ|Y (z|y; θ̂old
) ln pZ|Y (z|y; θ)dz z
• Evaluate z at θ = θ̂new
and θ = θ̂old
Lid(θ̂new
) = Q(θ̂new
; θ̂old
)−
Z
pZ|Y (z|y; θ̂old
) ln pZ|Y (z|y; θ̂new
)dz, ♠
Lid(θ̂old
) = Q(θ̂old
; θ̂old
)−
Z
pZ|Y (z|y; θ̂old
) ln pZ|Y (z|y; θ̂old
)dz, ♣
• Subtract ♣ from ♠
Lid(θ̂new
) − Lid(θ̂old
) = Q(θ̂new
; θ̂old
) − Q(θ̂old
; θ̂old
)
−
R
pZ|Y (z|y; θ̂old
) ln
pZ|Y (z|y;θ̂new
)
pZ|Y (z|y;θ̂old)
dz
Lid(θ̂new
) − Lid(θ̂old
) = Q(θ̂new
; θ̂old
) − Q(θ̂old
; θ̂old
)
−
R
pZ|Y (z|y; θ̂old
) ln
pZ|Y (z|y;θ̂new
)
pZ|Y (z|y;θ̂old)
dz
• A really helpful inequality: ln x 6 x − 1
Lid(θ̂new
) − Lid(θ̂old
) ≥ Q(θ̂new
; θ̂old
) − Q(θ̂old
; θ̂old
)
−
Z
pZ|Y (z|y; θ̂old
)

pZ|Y (z|y; θ̂new
)
pZ|Y (z|y; θ̂old
)
− 1
#
dz
| {z }
focus on this term
9
E
M
Z
pZ|Y (z|y; θ̂old
)

pZ|Y (z|y; θ̂new
)
pZ|Y (z|y; θ̂old)
− 1
#
dz
=
Z
pZ|Y (z|y; θ̂new
) − pZ|Y (z|y; θ̂old
)dz
=
Z
pZ|Y (z|y; θ̂new
)dz−
Z
pZ|Y (z|y; θ̂old
)dz = 0
• Now we have:
Lid(θ̂new
)−Lid(θ̂old
) ≥ Q(θ̂new
; θ̂old
)−Q(θ̂old
; θ̂old
)
• Recall the definition of the M-step:
θ̂new
= arg max
θ
Q(θ; θ̂old
)
• So, by definition
Q(θ̂new
; θ̂old
) ≥ Q(θ̂old
; θ̂old
)
⇒ Lid(θ̂new
) ≥ Lid(θ̂old
)
• Notice we showed that the likelihood was
nondecreasing; that doesn’t automatically
10
E
M
imply that the parameter estimates
converge
• Parameter estimate could slide along a
contour of constant loglikelihood
• Can prove some things about parameter
convergence in special cases Ex: EM
Algorithm for Imaging from Poisson Data
(i.e. Emission Tomography)
Generalized EM Algorithms
• Recall this line:
Lid(θ̂new
)−Lid(θ̂old
) ≥ Q(θ̂new
; θ̂old
)−Q(θ̂old
; θ̂old
)
• What if the M-step is too hard? Try a
“generalized” EM algorithm:
θ̂new
= some easy to compute θ
such thatQ(θ; θ̂old
) ≥ Q(θ̂old
; θ̂old
)
• Problem: EM algorithms tend to be slow
• Observation: “Bigger” complete data
spaces result in slower algorithms than
“smaller” complete data spaces
11
E
M
• SAGE (Space-Alternating Generalized
Expectation-Maximization)
– Split big complete data space into
several smaller “hidden” data spaces
– Designed to yield faster convergence
• Generalization of “ordered subsets” EM
algorithm
12
W
i
e
n
e
r
Wiener Filtering
• Context: Bayesian linear MMSE
estimation for random sequences
• Parameter sequence {Θk, k ∈ Z}
• Data sequence {Yk, k ∈ I ⊂ Z}
• Goal: Estimate {θk} as a linear function of
the observations:
θ̂k(y) =
X
j∈I
h(k, j)yj
• Find h to minimize mean square error
• By the orthogonality principle
E[(θ̂k(Y ) − Θk)Y ∗
i ] = 0fori ∈ I
E[(
X
j∈I
h(k, j)Yj − Θk)Y ∗
i ] = 0
X
j∈I
h(k, j)E[YjY ∗
i ] = E[ΘkY ∗
i ]
X
j∈I
h(k, j)rY (j, i) = rΘY (k, i)
| {z }
This is Wiener-Hopf equation
1
W
i
e
n
e
r
• If processes are stationary, we can write
X
j∈I
h(k, j)rY (j − i) = rΘY (k − i)
• If I = Z it turns out the filter is LTI:
X
j∈I
h(k − j)rY (j − i) = rΘY (k − i)
• consider i = 0
X
j∈I
h(k − j)rY (j) = rΘY (k)
• Can solve W-H in the Z-transform domain:
H(z)SY (z) = SΘY (z)
H(z) =
SΘY (z)
SY (z)
• MSE:
MSE = E[|(Θk − θ̂(Y ))|2
]
= E[(Θk − θ̂(Y ))(Θ∗
k − θ̂∗
(Y ))]
= E[(Θk − θ̂(Y ))Θ∗
k] +E[(Θk − θ̂(Y ))θ̂∗
(Y )]
= E[ΘkΘ∗
k] − E[θ̂k(Y )Θ∗
k]]
2
W
i
e
n
e
r
MSE = E[ΘkΘ∗
k] − E[θ̂k(Y )Θ∗
k]]
= E[ΘkΘ∗
k] − E[
X
j∈I
h(k − j)YjΘ∗
k]]
= E[ΘkΘ∗
k] −
X
j∈I
h(k − j)E[YjΘ∗
k]
= rΘ(0) −
X
j∈I
h(k − j)rY Θ(j − k)
• Since everything is stationary, can just take
k = 0
MSE = rΘ(0) − (h ∗ rY Θ)(0)
MSE = rΘ(0) − (h ∗ rY Θ)(0)
=
π
Z
−π
SΘ(ω) − H(ω)SY Θ(ω)dω
=
π
Z
−π
SΘ(ω) −
SΘY (ω)
SY (ω)
SY Θ(ω)dω
=
π
Z
−π
SΘ(ω) −
|SΘY (ω)|2
SY (ω)
dω
3
W
i
e
n
e
r
Deblurring
• Suppose object is observed through a
blurring point spread function f and
additive noise W
Yk = (f ∗ Θ)k + Wk
• Suppose Θ and W are uncorrelated
zero-mean
SY = |F|
2
Sθ + SW , and SΘY = F∗
SY
• So the Wiener filter is
H(z) =
SΘY (z)
SY (z)
=
F∗
(z)SΘ(z)
|F(z)|
2
Sθ(z) + SW (z)
Interpretation of the Deblurring Filter
If noise is negligible, i.e. SW (ω) ≈ 0
H(ω) =
F∗
Sθ
|F|
2
Sθ + SW
≈
F∗
Sθ
FF∗Sθ
=
1
F
• Even if there is no noise, in
implementation, straight division by F(ω)
is often ill-posed and not a good idea
4
W
i
e
n
e
r
(round off errors, etc.)
Deblurring Error
MSE =
π
Z
−π
SΘ(ω) −
|SΘY (ω)|2
SY (ω)
dω
=
π
Z
−π
SΘ(ω) −
|F(ω)|
2
|SΘ(ω)|2
|F(ω)|
2
SΘ(ω) + SW (ω)
dω
=
π
Z
−π
SΘ[|F|
2
SΘ + SW ] − |F|
2
|SΘ|2
|F|
2
SΘ + SW
dω
=
π
Z
−π
SΘSW
|F|
2
SΘ + SW
dω
Competing Approaches
• Competing approaches include iterative
methods such as the “Richardson-Lucy”
algorithm (an EM-style procedure)
– Computationally intensive
– Can naturally incorporate nonnegativity
– Sometimes better match to real
5
W
i
e
n
e
r
statistics
Discussion
• Advantage of Wiener approach:
– LTI filtering implementation
• Disadvantages of Wiener approach:
– No natural way to incorporate
nonnegativity constraints (in image
processing, for instance)
– Only truly optimal for Gaussian
statistics
Real-Time Wiener Filtering
• What if we don’t have “future”
measurements?
• Must restrict h to be causal
• Solution:
H(z) =
1
S−
Y (z)

SΘY (z)
S+
Y (z)

+
where the meaning of the plus and minus
superscripts and subscripts will be defined
6
W
i
e
n
e
r
on later slides Spectral Factorization
• If Y has a spectrum satisfying the
Paley-Wiener criterion:
π
Z
−π
log SY (ω)dω  −∞
then the spectrum can be factored as
SY (ω) = S+
Y (ω)S−
Y (ω)
where F−1
{S+
Y } is causal
and F−1
{S−
Y } is anticausal
Factoring Rational Spectra
• If the spectrum is a ratio of polynomials,
we can factor as
SY (z) = S+
Y (z)
| {z }
Poles and zeros
inside unit circle
S−
Y (z)
| {z }
Poles and zeros
outside unit circle
= S+
Y (z)S+
Y (z−1
)
7
W
i
e
n
e
r
• Aside: spectral factorization into causal
and anticausal factors is analogous to
Cholesky decomposition of a covariance
matrix into lower and upper triangular
factors
Causal Part Extraction
• We can split f into its causal and
anticausal parts:
f(k) = {f(k)}+
| {z }
causal
+ {f(k)}−
| {z }
anticausal
f(k)+ = f(k)u(k), {f(k)}− = f(k)u(−k−1)
• Use similar notation for Z-transform
domain
F(z) = {F(z)}+ + {F(z)}−
{F}+ = Z{Z−1
{F}u(k)}
{F}− = Z{Z−1
{F}u(−k − 1)}
How to Extract Causal Parts
• If F is a ratio of polynomials can usually
8
W
i
e
n
e
r
do a partial fraction expansion:
F(z) = {F(z)}+
| {z }
?

Poles and zeros
inside unit circle
+ {F(z)}−
| {z }
?
Poles and zeros
outside unit circle
• Can also do polynomial long division
9
C
h
e
r
n
o
ff
Chernoff Bounds (Theory)
• General purpose likelihood ratio test
p(y; H1)
p(y; H0)
or
p(y|H1)
p(y|H0)
H1


H0
λ
• Consider the loglikelihood ratio test
L ≡ ln Λ = ln
p(y|H1)
p(y|H0)
H1


H0
ln λ ≡ γ
• Conditional error probabilities:
PD =
∞
Z
γ
pL|H1
(ℓ|H1)dℓ, PF A =
∞
Z
γ
pL|H0
(ℓ|H0)dℓ
• it is often difficult, if not impossible, to
find simple formulas for
pL|H1
(ℓ|H1), pL|H0
(ℓ|H0)
• Makes computing probabilities of detection
and false alarm difficult
– We could use Monte Carlo simulations,
but those are cumbersome
1
C
h
e
r
n
o
ff
– Alternative: find easy to compute,
analytic bounds on the error
probabilities
• Discussion based on Van Trees
A Moment Generating Function
ΦL|H0
(s) = E[esL
|H0] =
Z ∞
−∞
esℓ
pL(ℓ|H0)dℓ
=
Z
Y
exp[sL(y)]pY (y|H0)dy
=
Z
Y
exp

s ln
pY (y|H1)
pY (y|H0)

pY (y|H0)dy
=
Z
Y

pY (y|H1)
pY (y|H0)
s
pY (y|H0)dy
=
Z
Y
[pY (y|H1)]
s
[pY (y|H0)]
1−s
dy
• Define a new random variable Xs (for
various values of s) with density
pXs
(x) ≡
esx
pL|H0
(x|H0)
∞
R
−∞
esℓpL|H0
(ℓ|H0)dℓ
2
C
h
e
r
n
o
ff
µ(s) ≡ ln ΦL|H0
(s) = ln
∞
Z
−∞
esL
p(ℓ|H0)dℓ
µ̇(s) =
R ∞
−∞
ℓesℓ
p(ℓ|H0)dℓ
R ∞
−∞
esℓp(ℓ|H0)dℓ
= E[Xs]
µ̈(s) = var[Xs]
µ̇(0) =
R ∞
−∞
ℓe0ℓ
p(ℓ|H0)dℓ
R ∞
−∞
e0ℓp(ℓ|H0)dℓ
= E[L|H0]
µ̇(1) =
R ∞
−∞
ℓp(ℓ|H1)
p(ℓ|H0) p(ℓ|H0)dℓ
R ∞
−∞
p(ℓ|H1)
p(ℓ|H0) p(ℓ|H0)dℓ
= E[L|H1]
• with
µ(s) = ln ΦL|H0
(s) = ln
∞
Z
−∞
esL
p(l|H0)dl
• Then
∞
Z
γ
exp[µ(s) − sx]pXs (x)dx
3
C
h
e
r
n
o
ff
=
∞
Z
γ
exp[µ(s)]e−sx esx
pL|H0
(x|H0)
∞
R
−∞
esℓpL|H0
(ℓ|H0)dℓ
dx
=
∞
Z
γ
pL|H0
(x|H0)dx = PF A
PF A =
∞
Z
γ
exp[µ(s) − sx]pXs
(x)dx
= eµ(s)
∞
Z
γ
e−sx
pXs
(x)dx ≤ eµ(s)
∞
Z
γ
e−sγ
pXs
(x)dx
= exp[µ(s) − sγ]
∞
Z
γ
pXs
(x)dx
≤ exp[µ(s) − sγ]
PF A ≤ exp[µ(s) − sγ]
• We want the s ≥ 0 which makes the RHS
as small as possible
d
ds
[µ(s) − sγ] = µ̇(s) − γ,
4
C
h
e
r
n
o
ff
µ̇(s) = γ
• Assuming everything worked (things exist,
equation for maximizing s solvable, etc.):
PF A ≤ exp[µ(s) − sµ̇(s)]
PM ≤ exp[µ(s) + (1 − s)γ]
• We want the s ≤ 1 which makes the RHS
as small as possible
d
ds
[µ(s) + (1 − s)γ] = µ̇(s) − γ
µ̇(s) = γ
• Assuming everything worked (things exist,
equation for maximizing s solvable, etc.):
PM ≤ exp[µ(s) + (1 − s)µ̇(s)]
PF A ≤ exp[µ(s) − sµ̇(s)], 0 6 s 6 1
PM ≤ exp[µ(s)+(1−s)µ̇(s)], where γ = µ̇(s)
µ̇(0) ≤ γ ≤ µ̇(1)
E[L|H0] ≤ γ ≤ E[L|H1]
Why is this useful? L can often be easily
5
C
h
e
r
n
o
ff
described by its moment generating
function.
• Let s = sm satisfy µ̇(sm) = γ = 0
Pe =
1
2
PF A +
1
2
PM
≤
1
2
exp[µ(s)]
Z ∞
0
pXs
(x)dx+
1
2
exp[µ(s)]
Z 0
−∞
pXs
(x)dx
Pe ≤
1
2
exp[µ(sm)]
PF A = eµ(s)
∞
Z
µ̇(s)
e−sx
pXs
(x)dx
= exp[µ(s)−sµ̇(s)]
∞
Z
µ̇(s)
exp[+s(µ̇(s) − x)]pXs
(x)dx
= exp[µ(s)−sµ̇(s)]
∞
Z
0
exp[−s
p
µ̈(s)z]pZ(z)dz
where
Z =
Xs − E[Xs]
p
var[Xs]
=
Xs − µ̇(s)
p
µ̈(s)
6
C
h
e
r
n
o
ff
exp[µ(s) − sµ̇(s)]
∞
Z
0
exp[−s
p
µ̈(s)z]pZ(z)dz
| {z }
Original Chernoff
inequality was formed
by replacing this with 1.
We can get a tighter
constant in some
asymptotic cases.
Asymptotic Gaussian Approximation
• In some cases, Z approaches a Gaussian
random variable as the number of samples
n grows large (ex: data points iid with
finite means and variances)
∞
Z
0
exp[−s
p
µ̈(s)z]
1
√
2π
exp

z2
2

dz
= exp

s2
µ̈(s)
2

Q(s
p
µ̈(s))
7
C
h
e
r
n
o
ff
PF A = exp[µ(s)−sµ̇(s)]
∞
Z
0
exp[−s
p
µ̈(s)z]pZ(z)dz
≈ exp[µ(s)−sµ̇(s)] exp

s2
µ̈(s)
2

Q(s
p
µ̈(s))
• If s
p
µ̈(s)  3 we can approximate Q(·)
using an upper bound
Q(a) ≤
1
a
√
2π
exp

−
a2
2

PF A ≈
1
p
2πs2µ̈(s)
exp[µ(s) − sµ̇(s)]
Similar Analysis Works for PM
PM ≈ eµ(s)+(1−s)µ̇(s)
exp

(s − 1)2
µ̈(s)
2

Q((1−s)
p
µ̈(s))
• If (1 − s)
p
µ̈(s)  3 we can approximate
Q(·) using the upper bound
PM ≈
1
p
2π(1 − s)2µ̈(s)
exp[µ(s) + (1 − s)µ̇(s)]
Asymptotic Analysis for Pe
8
C
h
e
r
n
o
ff
• For the case of equal priors and equal
costs, if the conditions for the
approximation for Q(·) to be valid on the
previous to slides holds, we have
Pe ≈
1
2sm(1 − sm)
p
2πµ̈(sm)
exp[µ(sm)]
9
C
h
e
r
n
o
ff
Chernoff Bounds (Gaussian Examples)
Consider the loglikelihood ratio test
L ≡ ln Λ = ln
p(y|H1)
p(y|H0)
H1


H0
ln λ ≡ γ
Main object of interest: µ(s) ≡ ln ΦL|H0
(s)
ΦL|H0
(s) = E[esL
|H0] =
Z ∞
−∞
esl
pL(l|H0)dl
=
Z
Y
[p(y|H1)]
s
[p(y|H0)]
1−s
dy
Both representations will be useful
10
C
h
e
r
n
o
ff
Gaussian, Equal Variances
H1 ∼ N(m, σ2
), H0 ∼ N(0, σ2
)
µ(s) = ln
Z
Y
[p(y|H1)]
s
[p(y|H0)]
1−s
dy
= ln
∞
Z
−∞
· · ·
∞
Z
−∞
( n
Y
i=1
1
√
2πσ2
exp

−
(yi − m)2
2σ2
)s
×
( n
Y
i=1
1
√
2πσ2
exp

−
y2
i
2σ2
)1−s
dy1 · · · dyn
= n ln
∞
Z
−∞
1
√
2πσ2
exp

−
(y − m)2
s + y2
(1 − s)
2σ2

dy
11
C
h
e
r
n
o
ff
Completing the Square
∞
Z
−∞
1
√
2πσ2
exp

−
(y − m)2
s + y2
(1 − s)
2σ2

dy
=
∞
Z
−∞
1
√
2πσ2
exp

−
(y2
− 2my + m2
)s + y2
(1 − s)
2σ2

dy
=
∞
Z
−∞
1
√
2πσ2
exp

−
y2
− 2msy + m2
s
2σ2

dy
=
∞
Z
−∞
1
√
2πσ2
exp

−
(y2
− 2msy + m2
s2
) − m2
s2
+ m2
s
2σ2

12
C
h
e
r
n
o
ff
Finish Computing the µ
=
∞
Z
−∞
1
√
2πσ2
exp

−
(y2
− 2msy + m2
s2
) − m2
s2
+ m2
s
2σ2

=
∞
Z
−∞
1
√
2πσ2
exp

−
(y − ms)2
2σ2

exp

−
m2
s(1 − s)
2σ2

dy
= exp

m2
s(s − 1)
2σ2

= tmp
µ(s) = n ln {tmp} =
s(s − 1)
2
nm2
σ2
≡
s(s − 1)
2
d2
13
C
h
e
r
n
o
ff
Basic Bound on PFA
µ(s) =
s(s − 1)
2
d2
, µ̇(s) =
2s − 1
2
d2
PF A ≤ exp[µ(s) − sµ̇(s)], for 0 ≤ s ≤ 1
= exp

s(s − 1)
2
d2
− s
(2s − 1)
2
d2

= exp

−
s2
2
d2

where γ = µ̇(s), γ =
2s − 1
2
d2
s =
γ
d2
+
1
2
14
C
h
e
r
n
o
ff
Basic Bound on PM
Pm ≤ exp[µ(s) + (1 − s)µ̇(s)]
= exp

s(s − 1)
2
d2
+ (1 − s)
(2s − 1)
2
d2

= exp

s2
− s
2
d2
+
2s − 1 − 2s2
+ s
2
d2

= exp

2s − 1 − s2
2
d2

= exp

−
(1 − s)2
2
d2

15
C
h
e
r
n
o
ff
Where are the Bounds Meaningful? Recall we
need
E[L|H0] ≤ γ ≤ E[L|H1]
µ̇(0) ≤ γ ≤ µ̇(1)
2 · 0 − 1
2
d2
≤ γ ≤
2 · 1 − 1
2
d2
−
d2
2
≤ γ ≤
d2
2
16
C
h
e
r
n
o
ff
The Refined Bound for PFA Recall the refined
asymptotic bound:
PF A ≈ exp[µ(s) − sµ̇(s)] exp


s2µ̈(s)
2

 Q(s
q
µ̈(s))
µ̇(s) =
2s − 1
2
d2
, µ̈(s) = d2
In this case, since L is a sum of Gaussian
random variables, the expression is exact:
PF A = exp

−
s2
2
d2

exp

s2
d2
2

Q(sd) = Q(sd)
17
C
h
e
r
n
o
ff
The Refined Bound for PM
PM ≈ eµ(s)+(1−s)µ̇(s)
exp

(s − 1)2
µ̈(s)
2

Q((1−s)
p
µ̈(s))
= exp

−
(1 − s)2
2
d2

exp

(s − 1)2
d2
2

Q((1−s)d)
Again, since L is Gaussian, the expression is
exact:
PM = Q((1 − s)d)
18
C
h
e
r
n
o
ff
Minimum Prob. of Error For minimum prob.
of error test, γ = 0
sM =
γ
d2
+
1
2
=
1
2
Recall approximate expression for Pe from last
slide of last lecture
Pe ≈
1
2sm(1 − sm)
p
2πµ̈(sm)
exp[µ(sm)]
=
1
2sm(1 − sm)
√
2πd2
exp

sm(sm − 1)
2
d2

19
C
h
e
r
n
o
ff
Min. Prob. of Error Cont
Pe ≈
2
√
2πd2
exp

−
d2
8

Recall the exact expression is:
Pe = Q(d/2)
Van Trees’ rule of thumb: approximation is
very good for d  6
20
C
h
e
r
n
o
ff
The Bhattacharyya Distance If the criterion is
the minimum prob. of error and µ(s) is
symmetric about s = 1/2, then
µ(s) = ln
Z
Y
p
p(y|H1)
p
p(y|H0)dy
µ(s) is called the Battacharyya distance
21
C
h
e
r
n
o
ff
Gaussian, Equal Means
H1 ∼ N(0, σ2
1), H0 ∼ N(0, σ2
0),
µ(s) =
n
2
ln
(σ2
0)s
(σ2
1)1−s
sσ2
0 + (1 − s)σ2
1
A common special case:
σ2
1 = σ2
s + σ2
n, σ2
0 = σ2
n
µ(s) =
n
2

(1 − s) ln

1 +
σ2
s
σ2
n

− ln

1 + (1 − s)
σ2
s
σ2
n

Gaussian, Equal Means
µ̇(s) =
n
2

− ln

1 +
σ2
s
σ2
n

+
σ2
s /σ2
n
1 + (1 − s)σ2
s /σ2
n

µ̈(s) =
n
2

σ2
s /σ2
n
1 + (1 − s)σ2
s /σ2
n
2
22
U
M
P
Uniformly Most Powerful Tests
• Usual parametric data model p(y; θ)
• Consider a composite problem:
H0 : θ = θ0, H1 : θ ∈ S1
• A test φ∗
is uniformly most powerful of
level α = PF A if it has a better PD (or at
least as good as) than any other αlevel test
PD(φ∗
; θ) = Eθ[φ∗
] ≥ Eθ[φ] = PD(φ, θ)
for all θ ∈ S1, for all φ
α = PF A(φB) = PF A(φA) = PF A(φ∗
)
PD(φB,θ)
PD(φA,θ)
PD(φ∗
,θ)
Figure 1: UMP test.
1
U
M
P
• Find the most powerful α-level (recall
α = PF A) test for a fixed θ
• Just the Neyman-Pearson test
If the decision regions do not vary with θ,
then the test is UMP
Gaussian Mean Example
• Suppose we have n i.i.d samples
Yi ∼ N(µ, σ2
)
• Assume σ2
is known, but µ is not
• Consider three cases
H0 : µ = 0
Case I : H1 : µ  0
Case II : H1 : µ  0
Case III : H1 : µ 6= 0
Suffices to use
ȳ =
1
n
n
X
i=1
yi
2
U
M
P
Ȳ ∼ N(µ, σ2
/n)
Λ(y; µ) =
p(y; µ)
p(y; 0)
=
exp[−(ȳ − µ)2
/(2σ2
/n)]
exp[−ȳ2/(2σ2/n)]
=
exp[(−ȳ2
+ 2ȳµ − µ2
)/(2σ2
/n)]
exp[−ȳ2/(2σ2/n)]
= exp

nµ
σ2
ȳ −
nµ2
2σ2
 H1


H0
τ
√
nµ
σ
ȳ
H1


H0

ln τ +
nµ2
2σ2

σ
√
n
≡ γ
• Case I: µ  0
√
nµ
σ
ȳ
H1


H0
γ −→
√
n
σ
ȳ
H1


H0
γ
µ
= γ+
• Set the threshold to get the right “level”
α = PF A = Pr
 √
n
σ
Ȳ  γ+
H0

= Q(γ+
)
γ+
= Q−1
(α)
• Notice the test does not depend on µ;
3
U
M
P
hence, it is UMP
PD = Pr
 √
n
σ
Ȳ  γ+
H1

= Pr
 √
n
σ
(Ȳ − µ)  γ+
−
√
n
σ
µ

Más contenido relacionado

La actualidad más candente

Signal modelling
Signal modellingSignal modelling
Signal modelling
Debangi_G
 
Chapter 6m
Chapter 6mChapter 6m
Chapter 6m
wafaa_A7
 
Dft and its applications
Dft and its applicationsDft and its applications
Dft and its applications
Agam Goel
 
Isi and nyquist criterion
Isi and nyquist criterionIsi and nyquist criterion
Isi and nyquist criterion
srkrishna341
 

La actualidad más candente (20)

Eye diagram in Communication
Eye diagram in CommunicationEye diagram in Communication
Eye diagram in Communication
 
Digital Communication Unit 1
Digital Communication Unit 1Digital Communication Unit 1
Digital Communication Unit 1
 
Sampling Theorem
Sampling TheoremSampling Theorem
Sampling Theorem
 
Nyquist criterion for zero ISI
Nyquist criterion for zero ISINyquist criterion for zero ISI
Nyquist criterion for zero ISI
 
Wiener filters
Wiener filtersWiener filters
Wiener filters
 
DSP_2018_FOEHU - Lec 03 - Discrete-Time Signals and Systems
DSP_2018_FOEHU - Lec 03 - Discrete-Time Signals and SystemsDSP_2018_FOEHU - Lec 03 - Discrete-Time Signals and Systems
DSP_2018_FOEHU - Lec 03 - Discrete-Time Signals and Systems
 
Pulse modulation
Pulse modulationPulse modulation
Pulse modulation
 
Signal modelling
Signal modellingSignal modelling
Signal modelling
 
ADDITTIVE WHITE GAUSIAN NOIS ( AWGN)
ADDITTIVE WHITE GAUSIAN NOIS ( AWGN)ADDITTIVE WHITE GAUSIAN NOIS ( AWGN)
ADDITTIVE WHITE GAUSIAN NOIS ( AWGN)
 
Eye pattern
Eye patternEye pattern
Eye pattern
 
Lecture Notes on Adaptive Signal Processing-1.pdf
Lecture Notes on Adaptive Signal Processing-1.pdfLecture Notes on Adaptive Signal Processing-1.pdf
Lecture Notes on Adaptive Signal Processing-1.pdf
 
Sampling theorem
Sampling theoremSampling theorem
Sampling theorem
 
Chapter 6m
Chapter 6mChapter 6m
Chapter 6m
 
Dft and its applications
Dft and its applicationsDft and its applications
Dft and its applications
 
Information theory
Information theoryInformation theory
Information theory
 
FILTER DESIGN
FILTER DESIGNFILTER DESIGN
FILTER DESIGN
 
Pulse Modulation ppt
Pulse Modulation pptPulse Modulation ppt
Pulse Modulation ppt
 
Nyquist criterion for distortion less baseband binary channel
Nyquist criterion for distortion less baseband binary channelNyquist criterion for distortion less baseband binary channel
Nyquist criterion for distortion less baseband binary channel
 
Isi and nyquist criterion
Isi and nyquist criterionIsi and nyquist criterion
Isi and nyquist criterion
 
Digital System Processing
Digital System Processing Digital System Processing
Digital System Processing
 

Similar a Detection & Estimation Theory

Adjusting PageRank parameters and comparing results : REPORT
Adjusting PageRank parameters and comparing results : REPORTAdjusting PageRank parameters and comparing results : REPORT
Adjusting PageRank parameters and comparing results : REPORT
Subhajit Sahu
 

Similar a Detection & Estimation Theory (20)

New tools from the bandit literature to improve A/B Testing
New tools from the bandit literature to improve A/B TestingNew tools from the bandit literature to improve A/B Testing
New tools from the bandit literature to improve A/B Testing
 
Arima model (time series)
Arima model (time series)Arima model (time series)
Arima model (time series)
 
career point.pptx
career point.pptxcareer point.pptx
career point.pptx
 
Linear Regression
Linear RegressionLinear Regression
Linear Regression
 
1. linear model, inference, prediction
1. linear model, inference, prediction1. linear model, inference, prediction
1. linear model, inference, prediction
 
Time series Modelling Basics
Time series Modelling BasicsTime series Modelling Basics
Time series Modelling Basics
 
Top school in noida
Top school in noidaTop school in noida
Top school in noida
 
Bayesian Inference and Uncertainty Quantification for Inverse Problems
Bayesian Inference and Uncertainty Quantification for Inverse ProblemsBayesian Inference and Uncertainty Quantification for Inverse Problems
Bayesian Inference and Uncertainty Quantification for Inverse Problems
 
Enhance interval width of crime forecasting with ARIMA model-fuzzy alpha cut
Enhance interval width of crime forecasting with ARIMA model-fuzzy alpha cutEnhance interval width of crime forecasting with ARIMA model-fuzzy alpha cut
Enhance interval width of crime forecasting with ARIMA model-fuzzy alpha cut
 
Geometric Sequence and Geometric Mean
Geometric Sequence and Geometric MeanGeometric Sequence and Geometric Mean
Geometric Sequence and Geometric Mean
 
Quantile and Expectile Regression
Quantile and Expectile RegressionQuantile and Expectile Regression
Quantile and Expectile Regression
 
Slide2
Slide2Slide2
Slide2
 
Chap04
Chap04Chap04
Chap04
 
Mle
MleMle
Mle
 
Generic Reinforcement Schemes and Their Optimization
Generic Reinforcement Schemes and Their OptimizationGeneric Reinforcement Schemes and Their Optimization
Generic Reinforcement Schemes and Their Optimization
 
Ramsey number lower bounds
Ramsey number lower  boundsRamsey number lower  bounds
Ramsey number lower bounds
 
A Comparative Study of Acoustic Echo Cancellation Algorithms in Sparse Impuls...
A Comparative Study of Acoustic Echo Cancellation Algorithms in Sparse Impuls...A Comparative Study of Acoustic Echo Cancellation Algorithms in Sparse Impuls...
A Comparative Study of Acoustic Echo Cancellation Algorithms in Sparse Impuls...
 
Measures of Linkage Disequilibrium
Measures of Linkage DisequilibriumMeasures of Linkage Disequilibrium
Measures of Linkage Disequilibrium
 
Adjusting PageRank parameters and comparing results : REPORT
Adjusting PageRank parameters and comparing results : REPORTAdjusting PageRank parameters and comparing results : REPORT
Adjusting PageRank parameters and comparing results : REPORT
 
Cramer row inequality
Cramer row inequality Cramer row inequality
Cramer row inequality
 

Más de HAmindavarLectures

Más de HAmindavarLectures (13)

"Digital communications" undergarduate course lecture notes
"Digital communications" undergarduate course lecture notes"Digital communications" undergarduate course lecture notes
"Digital communications" undergarduate course lecture notes
 
Wavelet Signal Processing
Wavelet Signal ProcessingWavelet Signal Processing
Wavelet Signal Processing
 
Stochastic Processes - part 6
Stochastic Processes - part 6Stochastic Processes - part 6
Stochastic Processes - part 6
 
Stochastic Processes - part 5
Stochastic Processes - part 5Stochastic Processes - part 5
Stochastic Processes - part 5
 
Stochastic Processes - part 4
Stochastic Processes - part 4Stochastic Processes - part 4
Stochastic Processes - part 4
 
Stochastic Processes - part 3
Stochastic Processes - part 3Stochastic Processes - part 3
Stochastic Processes - part 3
 
Stochastic Processes - part 2
Stochastic Processes - part 2Stochastic Processes - part 2
Stochastic Processes - part 2
 
Stochastic Processes - part 1
Stochastic Processes - part 1Stochastic Processes - part 1
Stochastic Processes - part 1
 
Random Variables
Random VariablesRandom Variables
Random Variables
 
Cyclo-stationary processes
Cyclo-stationary processesCyclo-stationary processes
Cyclo-stationary processes
 
Multivariate Gaussin, Rayleigh & Rician distributions
Multivariate Gaussin, Rayleigh & Rician distributionsMultivariate Gaussin, Rayleigh & Rician distributions
Multivariate Gaussin, Rayleigh & Rician distributions
 
Introduction to communication systems
Introduction to communication systemsIntroduction to communication systems
Introduction to communication systems
 
Advanced Communications Theory
Advanced Communications Theory Advanced Communications Theory
Advanced Communications Theory
 

Último

Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
kauryashika82
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
QucHHunhnh
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
PECB
 
Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.
MateoGardella
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
ciinovamais
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
negromaestrong
 
Gardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch LetterGardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch Letter
MateoGardella
 
An Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdfAn Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdf
SanaAli374401
 

Último (20)

Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writing
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
Gardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch LetterGardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch Letter
 
An Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdfAn Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdf
 

Detection & Estimation Theory

  • 1. Introduction to Estimation Theory Bayesian (Random) Parameter Estimation Nonrandom Parameter Estimation AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.1/149
  • 2. In estimation problem we assign a cost to all pairs [a, â(r)] over the range of interest. In many cases of interest it is realistic to assume that the cost depends only on the error: AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.2/149
  • 3. aǫ(r) = (â(r) − a)2 aǫ(r) = |â(r) − a| aǫ(r) = 0, |aǫ| 6 ∆ 2 0, |aǫ| ∆ 2 AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.3/149
  • 4. Our goal is to find an estimate that minimizes the expected value of the cost R = E {c [a, â(r)]} = Z ∞ −∞ Z ∞ −∞ c [a, â(r)] p(a, r) drda R is the risk involved in doing the estimation of a out of observation(s) r. AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.4/149
  • 5. RRMS = Z ∞ −∞ Z ∞ −∞ (â(r) − a)2 p(a, r) drda = Z ∞ −∞ drp(r) Z ∞ −∞ (â(r) − a)2 p(a|r)da Because p(r) 0 we can minimize the inner integral: d dâ Z ∞ −∞ (â(r) − a)2 p(a|r) da = −2 Z ∞ −∞ ap(a|r) da +2â(r) Z ∞ −∞ p(a|r) da = 0 ⇒ AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.5/149
  • 6. Then, the mean square estimate is represented as: âRMS(r) = Z ∞ −∞ a p(a|r)da We have seen this before as the conditional mean. AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.6/149
  • 7. d dâ −2 Z ∞ −∞ ap(a|r)da + 2â(r) Z ∞ −∞ p(a|r) da = 2 0 Because the second derivative is positive âRMS is the mini- mum. AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.7/149
  • 8. The Bayes estimate for the absolute value criterion: Rabs = Z ∞ −∞ Z ∞ −∞ |â(r) − a| p(a, r) drda = Z ∞ −∞ drp(r) Z ∞ −∞ |â(r) − a| p(a|r)da The inner integral: I(r) = Z â(r) −∞ (â(r) − a) p(a|r)da + Z ∞ â(r) (a − â(r)) p(a|r)da ⇒ d dâ(r) I(r) = 0 ⇒ AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.8/149
  • 9. Z â(r) −∞ (â(r) − a) p(a|r)da = Z ∞ â(r) (a − â(r)) p(a|r)da ⇒ This is the definition for the median. The absolute error criterion leads to the determination of estimate of a out of the median of the observation(s) r. AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.9/149
  • 10. Runif = Z ∞ −∞ drp(r) 1 − Z âunif(r)+ ∆ 2 âunif(r)− ∆ 2 p(a|r)da # Minimizing Runif amounts to maximizing Z âunif(r)+ ∆ 2 âunif(r)− ∆ 2 p(a|r)da ⇒ âunif(r) occurs where p(a|r)|â(r)=a is maximum. AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.10/149
  • 11. This is MAP. A necessary, but not sufficient condition for a Max. is d da {ln p(a|r)}
  • 12.
  • 13.
  • 14.
  • 15. âMAP=a = 0 p(a|r) = p(r|a)p(a) p(r) ⇒ ln p(a|r) = ln p(r|a) + ln p(a) − ln p(r) | {z } not a function of a ⇒ max {ln p(a|r)} ≡ max {ln p(r|a) + ln p(a)} AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.11/149
  • 16. Example ri = a+ni, i = 1, 2, · · · , N, a ∼ N(0, σa), ni ∼ N(0, σn) ⇒ p(r|a) = ΠN i=1 1 σn √ 2π exp − (ri − a)2 2σ2 n p(a) = 1 σa √ 2π exp − a2 2σ2 a We need to compute R ∞ −∞ ap(a|r)da, one approach could be p(a|r) = p(a)p(r|a)/p(r) but this is tedious. However, one can observe that p(a|r) is PDF then: AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.12/149
  • 17. p(a|r) = 1 p(r) ( 1 (2π)N/2σN n 1 √ 2πσa exp − PN i=1(ri − a)2 2σ2 n # exp − a2 2σ2 a ) p(a|r) = k(r) exp    − 1 2σ2 p a − σ2 a σ2 a + σ2 n/N 1 N N X i=1 ri !#2    σ2 p = 1 σ2 a + N σ2 n −1 AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.13/149
  • 18. We see that p(a|r) is Gaussian, then: âMS(r) is the conditional mean: âMS(r) = σ2 a σ2 a + σ2 n/N 1 N N X i=1 ri ! AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.14/149
  • 19. if σ2 a ≫ σ2 n N ⇒ a priori knowledge is much better than the observed data. if σ2 a ≪ σ2 n N ⇒ a priori knowledge is not enough and the estimate uses the received data. AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.15/149
  • 20. For MAP: The location p(a|r) is maximum is the mean of Gaussian⇒ âMAP(r) = âMS(r) = σ2 a σ2 a + σ2 n/N 1 N N X i=1 ri ! Also because the median of Gaussian occurs at the mean then for this problem: âMAP(r) = âMS(r) = âabs(r) AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.16/149
  • 21. This invariance to choice of cost function is obviously be- cause of the subjective judgements that are frequently in- volved in choosing the cost function. AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.17/149
  • 22. An example of a nonlinear problem: rm = a3 + νm, m = 1, 2, · · · , M, νk ∼ N(0, σν), a ∼ N(0, σa) p(a|r) = k(r) exp ( − 1 2 PM m=1 (rm − a3 ) 2 σ2 n + a2 σ2 a #) âMAP(r) = (PM m=1 [rm − a3 ] (3a2 ) σ2 n + a σ2 a )
  • 23.
  • 24.
  • 25.
  • 26.
  • 28. Example: Pr(n even|a) = an n! e−a , n = 0, 1, 2, · · · , ∞ p(a) = λe−λa , a 0 ⇒ Pr(a|n) = Pr(n|a)p(a) Pr(n) = k(n) an n! e−a λe−λa because R ∞ 0 p(a|n) da = 1 ⇒ k(n) = (1 + λ)(n+1) λ AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.19/149
  • 29. âMS(n) = Z ∞ 0 ap(a|n) da = n + 1 λ + 1 âMAP(n) = max {ln p(r|a) + ln p(a)} = n λ + 1 Z âabs 0 p(a|n) da = Z ∞ âabs p(a|n) da = polynomial solution, no closed form AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.20/149
  • 30. The first measure of quality is: E{â(r)} = Z ∞ −∞ â(r) p(r|a) dr 1. if E{â(r)} = a, unbiased estimate. 2. if E{â(r)} = a + b, biased, but known. 3. if E{â(r)} = a + b(a), biased, but unknown. Even an unbiased estimate could yield bad results. AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.21/149
  • 31. Usually the PDF of the estimate is centered around a. Therefore, the second measure of quality is the variance of the estimate. var[â(r) − a] = E [â(r) − a]2 − B2 (a) General strategy: We shall try to find an unbiased estimate with small vari- ance. AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.22/149
  • 32. Maximum Likelihood Estimation (MLE): r = a + n, p(r|a) = N(a, σn) We choose the value of a that most likely caused a given value of a. The likelihood function (LF) of the observation given the a is p(r|a), or the log-likelihood function (LLF) ln p(r|a) AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.23/149
  • 33. We maximize LF or LLF with respect to the unknown parameter âML(r) is the value of a at which p(r|a) is maximum. If âML(r) is interior to a, and ln p(r|a) has a maximum then a = âML(r) is that value. The ML estimate is the limiting value of MAP as the a priori knowledge → 0: MAP:      ∂ ∂a ln p(r|a) + ∂ ∂a ln p(a) | {z } a priori knowledge     
  • 34.
  • 35.
  • 36.
  • 37.
  • 38.
  • 39.
  • 41. If a(r) is any unbiased estimate of a ⇒ var[a(r) − a] E ( ∂ ∂a ln p(r|a) 2 )!−1 var[a(r) − a] −E ∂2 ∂a2 ln p(r|a) −1 These are called CRLB. AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.25/149
  • 42. Any estimate that satisfies CRLB with equality is called an efficient estimate. AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.26/149
  • 43. Because â(r) is unbiased: E[a(r) − a] = Z ∞ −∞ p(r|a)[a(r) − a] dr = 0 ∂ ∂a {E[a(r) − a]} = Z ∞ −∞ ∂p(r|a) ∂a [a(r) − a] dr − 1 = 0 ∂p(r|a) ∂a = ∂ ln p(r|a) ∂a p(r|a) Z ∞ −∞ ∂ ln p(r|a) ∂a p(r|a)[a(r) − a] dr = 1 AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.27/149
  • 44. Z ∞ −∞ ∂ ln p(r|a) ∂a p p(r|a) p p(r|a)[a(r) − a] dr = 1 Using Schwartz inequality: Z ∞ −∞ ∂ ln p(r|a) ∂a 2 p(r|a) dr Z ∞ −∞ p(r|a)[a(r) − a]2 dr | {z } var[a(r)−a] 1 ⇒ var[a(r) − a] E n ∂ ∂a ln p(r|a) 2 o−1 AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.28/149
  • 45. Equality holds iff: ∂ ln p(r|a) ∂a = k(a)[a(r) − a] AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.29/149
  • 46. For the 2nd representation: Z ∞ −∞ p(r|a) dr = 1 ⇒ Z ∞ −∞ ∂p(r|a) ∂a dr = Z ∞ −∞ ∂ ln p(r|a) ∂a p(r|a) dr = 0 differentiating again: Z ∞ −∞ ∂2 ln p(r|a) ∂a2 p(r|a) dr + Z ∞ −∞ ∂ ln p(r|a) ∂a 2 p(r|a) dr = 0 AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.30/149
  • 47. E ∂2 ln p(r|a) ∂a2 = −E ( ∂ ln p(r|a) ∂a 2 ) Then, the 2nd representation results. 1. From CRLB, any unbiased estimate must have a variance greater than a certain limit. 2. if ∂ ln p(r|a)/∂a = k(a)[a(r) − a], âML(r) will satisfy the CRLB with equality ∂ ln p(r|a) ∂a
  • 48.
  • 49.
  • 50.
  • 51. a=âML(r) = 0 = k(a)[a(r) − a] ⇒ a(r) = âML(r) or k(âML(r)) = 0 AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.31/149
  • 52. Example: ri = a + ni, i = 1, 2, · · · , N, ni ∼ N(0, σn) ∂ ln p(r|a) ∂a = N σ2 n 1 N N X i=1 ri − a # = 0 ⇒ âML(r) = 1 N N X i=1 ri E [âML(r)] = 1 N N X i=1 E(ri) = 1 N N X i=1 a = a ⇒ âML(r) is the unbiased estimator. AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.32/149
  • 53. The variance of the estimator: ∂2 ln p(r|a) ∂a2 = − N σ2 n ⇒ var[âML(r) − a] = σ2 n N → 0 N → ∞ AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.33/149
  • 54. Example: Pr(n event |a) = an n! e−a , n = 0, 1, 2, · · · , N ∂ ln p(n = N|a) ∂a = ∂ ∂a [N ln a − a − ln N!] = N a −1 = 1 a [N−a] ⇒ âML(N) = N ∂2 ln p(n = N|a) ∂a2 = − N a2 ⇒ var [âML(N) − a] = a AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.34/149
  • 55. Example: ri = s(a) + ni, i = 1, 2, · · · , N, ni ∼ N(0, σn) p(r|a) = 1 √ 2πσn N exp − PN i=1 (ri − s(a))2 2σ2 n # ∂ ln p(r|a) ∂a = 1 σ2 n N X i=1 (ri − s(a)) ∂s(a) ∂a In general cannot be written in the form required by: ∂ ln p(r|a) ∂a = k(a)[a(r) − a] AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.35/149
  • 56. Therefore, an unbiased efficient estimate does not exist. AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.36/149
  • 58.
  • 59.
  • 60.
  • 61.
  • 62. a=âML(r) ⇒ âML(r) = s−1 1 N N X i=1 ri ! ∂2 ln p(r|a) ∂a2 = 1 σ2 n N X i=1 [ri − s(a)] ∂2 s(a) ∂a2 − N σ2 n ∂s(a) ∂a 2 ⇒ E ∂2 ln p(r|a) ∂a2 = − N σ2 n ∂s(a) ∂a 2 AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.37/149
  • 63. Because E{ P ri − s(a)} = 0 ⇒ var[âML(r) − a] σ2 n N h ∂s(a) ∂a i2 AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.38/149
  • 64. Example of Bayesian Estimation Suppose we collect n Poisson distributed data points with mean θ Yi ∼ Poiss(θ) p(yi|θ) = e−θ θyi yi! , yi ∈ 0, 1, 2... Likelihood is: p(y|θ) = n Y i=1 e−θ θyi yi! Suppose prior is exponentially distributed with mean 1/b p(θ) = b exp(−bθ)u(θ) AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.39/149
  • 65. Posterior is: p(θ|y) = p(y|θ)p(θ) p(y) = n Q i=1 e−θ θyi yi! b exp(−bθ) ∞ R 0 n Q i=1 e−θ θyi yi! b exp(−bθ)dθ E[Θ|Y = y] = ∞ Z −∞ θp(θ|y)dθ = ∞ Z 0 θe−(n+b)θ θT ∞ R 0 e−(n+b)θ̃θ̃T dθ̃ dθ AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.40/149
  • 66. ∞ Z 0 θe−(n+b)θ θT dθ = ∞ Z 0 e−(n+b)θ θT+1 dθ = Γ(T + 2) (n + b)T+2 , T ≡ n X i=1 yi E[Θ|Y = y] = Γ(T+2) (n+b)T +2 Γ(T+1) (n+b)T +1 = (T+1)! (n+b)T +2 T! (n+b)T +1 E[Θ|Y = y] = 1 + n P i=1 yi n + b AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.41/149
  • 67. MAP Estimate: p(y|θ)p(θ) = e−nθ θT n Q i=1 yi! b exp(−bθ) ln p(θ|y) = −nθ + T ln θ − bθ d dθ ln p(θ|y) = −n + T θ − b = 0 θ̂MAP (y) = T n + b AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.42/149
  • 68. Minimum Absolute Error(MAE) Must solve: θ̂MAE(y) Z −∞ p(θ|y)dθ = 1/2 θ̂MAE(y) Z 0 e−(n+b)θ θT (n + b)T+1 T! dθ = 1/2 θ̂MAE(y) Z −∞ e−(n+b)θ θT dθ | {z } incomplete Gamma function = T! 2(n + b)T+1 Because: AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.43/149
  • 69. ∞ Z 0 e−(n+b)θ θT dθ = Γ(T + 1) (n + b)T+1 = T! (n + b)T+1 The solution for MAE is based on expressing as an “incom- plete Gamma function” (gammainc in MATLAB); this will have an inverse, so you could solve for θ̂MAE(y). AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.44/149
  • 70. Suppose we instead want to estimate γ = √ θ Can we just say γ̂CME(y) = q θ̂CME(y), CME = Conditional Mean Square γ̂MAP (y) = q θ̂MAP (y) We’ll see the answer is NO. AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.45/149
  • 71. Note that knowing γ is equivalent to knowing θ pY |Γ(y|γ) = pY |Θ(y|θ)
  • 72.
  • 74.
  • 75.
  • 76.
  • 77.
  • 78. θ=γ2 = n Y i=1 e−γ2 (γ2 ) yi yi! transformations of random variables: γ = g(θ) = √ θ, θ = g−1 (γ) = γ2 pΘ(θ) = b exp(−bθ), pΓ(γ) =
  • 79.
  • 80.
  • 81.
  • 83.
  • 84.
  • 85.
  • 86. pΘ(g−1 (γ)) = 2γb exp(−bγ2 ) AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.46/149
  • 87. Computing the New MAP Estimate pY |Γ(y|γ) = n Y i=1 e−γ2 (γ2 ) yi yi! , pΓ(γ) = 2γb exp(−bγ2 ) The new logposterior: H = ln pΓ|Y (γ|y) = −Nγ2 + (2 ln γ)T + ln γ − bγ2 dH dγ = −2Nγ + 2T γ + 1 γ − 2bγ = 0 AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.47/149
  • 88. γ̂2 MAP (y) = 2T + 1 2N + 2b = T + 1/2 N + b 6= T N + b = θ̂MAP (y) γ̂MAP = r T + 0.5 N + b As an aside, recall: θ̂CME(y) = T + 1 N + b 6= γ̂CME(y) If we went through the same exercise for the MMSE esti- mate, would probably come to similar conclusions!? AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.48/149
  • 89. In general, for Bayesian estimates f(θ̂(y)) 6= f(θ(y)), whether MAP, MMSE, MAE, or whatever For the special case of affine transformations, γ = f(θ) = aθ + b γ̂ = aθ̂ + b γ = g(θ) = aθ + b, θ = g−1 (γ) = γ − b a AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.49/149
  • 90. pY |Γ(y|γ) = pY |Θ y
  • 91.
  • 92.
  • 93.
  • 94. γ − b a , pΓ(γ) = 1 a pΘ γ − b a MAP estimation: ln pΓ|Y (γ|y) = ln pY |Θ y
  • 95.
  • 96.
  • 97.
  • 98. γ − b a + ln 1 a + ln pΘ γ − b a Similar arguments work for MMSE, MAE, etc. AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.50/149
  • 99. 1. For ML estimates as N → ∞ âML(r) → a in probability sense, this is called a consistent estimate. 2. ML estimate is asymptotically efficient. lim N→∞ var [âML(r) − a] −E n ∂2 ln p(r|a) ∂a2 o−1 = 1 3. ML estimate is asymptotically Gaussian, N(a, σaǫ ) AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.51/149
  • 100. Method of Moments To find the method of moments estimator of θ1, · · · , θp we set up and solve the equations: µ1 (θ1, · · · , θp) = m1 µ2 (θ1, · · · , θp) = m2 µp (θ1, · · · , θp) = mp The kth sample moment is defined to be: mk = 1 n n X i=1 xk i AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.52/149
  • 101. Example Let x1, · · · , xn denote a sample from the uniform distribution from θ1 to θ2. f (x; θ1, θ2) = 1 θ2−θ1 θ1 ≤ x ≤ θ2 0 elsewhere AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.53/149
  • 102. The joint density of x1, · · · , xn is: L (θ1, θ2) = f (x1, . . . , xn; θ1, θ2) = n Y i=1 f (xi; θ1, θ2) = ( 1 (θ2−θ1)n θ1 ≤ x1 ≤ θ2, . . . , θ1 ≤ xn ≤ θ2 0 elsewhere = ( 1 (θ2−θ1)n θ1 ≤ min i (xi) , max i (xi) ≤ θ2 0 elsewhere AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.54/149
  • 103. To find the maximum likelihood estimates of θ1 and θ2 we determine; ∂L (θ1, θ2) ∂θ1 and ∂L (θ1, θ2) ∂θ2 ∂L ∂θ1 = ( n (θ2 − θ1)−n+1 θ1 ≤ min i (xi) , max i (xi) ≤ θ2 0 elsewhere ∂L ∂θ2 = ( −n (θ2 − θ1)−n+1 θ1 ≤ min i (xi) , max i (xi) ≤ θ2 0 elsewhere AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.55/149
  • 104. Note ∂L(θ1,θ2) ∂θ1 and ∂L(θ1,θ2) ∂θ2 are never equal to zero. ∂L (θ1, θ2) ∂θ1 is always positive and ∂L (θ1, θ2) ∂θ2 is always negative. AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.56/149
  • 105. hence the maximum likelihood estimates of θ1 and θ2 are θ̂1 = min i (xi) , θ̂2 = max i (xi) This compares with the Method of moments estimators: θ̃1 = x̄ − v u u t3 n n X i=1 (xi − x̄)2 ! θ̃2 = x̄ + v u u t3 n n X i=1 (xi − x̄)2 ! AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.57/149
  • 106. The sampling distribution: θ̂1 = min i (xi) , θ̂2 = max i (xi) solution We use the distribution function method: θ̂1 = min i (xi) = m, θ̂2 = max i (xi) = M G1 (u) = P [m ≤ u] = P h min i (xi) ≤ u i = 1 − P h min i (xi) ≥ u i = 1 − P [x1 ≥ u, · · · , xn ≥ u] AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.58/149
  • 107. = 1 − P [x1 ≥ u] · · · P [xn ≥ u] = 1 − θ2 − u θ2 − θ1 n Thus the density of m = θ̂1 = min i (xi) is g1 (u) = G′ 1 (u) = −n θ2 − u θ2 − θ1 n−1 −1 θ2 − θ1 = n (θ2 − u)n−1 (θ2 − θ1)n AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.59/149
  • 108. Is m = θ̂1 = min i (xi) unbiased? E [m] = E h θ̂1 i = E h min i (xi) i = θ2 Z θ1 ug1 (u) du = θ2 Z θ1 u n (θ2 − u)n−1 (θ2 − θ1)n du Put v = θ2u then the above integral becomes 0 Z θ2−θ1 (θ2 − v) nvn−1 (θ2 − θ1)n − dv AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.60/149
  • 109. = n (θ2 − θ1)n  θ2 θ2−θ1 Z 0 vn−1 dv − θ2−θ1 Z 0 vn dv   E θ̂1 = n (θ2 − θ1)n  θ2 θ2−θ1 Z 0 vn−1 dv − θ2−θ1 Z 0 vn dv   = n (θ2 − θ1)n θ2 (θ2 − θ1)n n − (θ2 − θ1)n+1 n + 1 # AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.61/149
  • 110. = θ2 − n (θ2 − θ1) n + 1 = n n + 1 θ1 + 1 n + 1 θ2 = θ1 + 1 n + 1 (θ2 − θ1) AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.62/149
  • 111. Is M = θ̂2 = max i (xi) unbiased? E [M] = E h θ̂2 i = E h max i (xi) i = θ2 Z θ1 vg2 (v) dv = θ2 Z θ1 v n (v − θ1)n−1 (θ2 − θ1)n dv Put w = v − θ1 then the above integral becomes AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.63/149
  • 112. θ2−θ1 Z 0 (w + θ1) nwn−1 (θ2 − θ1)n dw = n (θ2 − θ1)n   θ2−θ1 Z 0 wn dw + θ1 θ2−θ1 Z 0 wn−1 dw   E θ̂2 = n (θ2 − θ1)n   θ2−θ1 Z 0 wn dw + θ1 θ2−θ1 Z 0 wn−1 dw   = n (θ2 − θ1)n (θ2 − θ1)n+1 n + 1 + θ1 (θ2 − θ1)n n # AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.64/149
  • 113. = n (θ2 − θ1) n + 1 + θ1 = 1 n + 1 θ1 + n n + 1 θ2 = θ2 − 1 n + 1 (θ2 − θ1) E θ̂2 − θ̂1 = E θ̂2 − E θ̂1 = θ2 − 1 n+1 (θ2 − θ1) − θ1 + 1 n+1 (θ2 − θ1) = 1 − 2 n+1 (θ2 − θ1) = n−1 n+1 (θ2 − θ1) AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.65/149
  • 114. Hence E n + 1 n − 1 h θ̂2 − θ̂1 i = θ2 − θ1 We can use this to get rid of the bias of θ̂1 and θ̂2 T1 = θ̂1 − 1 n + 1 n + 1 n − 1 h θ̂2 − θ̂1 i = θ̂1 − 1 n − 1 h θ̂2 − θ̂1 i = m − M − m n − 1 AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.66/149
  • 115. and T2 = θ̂2 + 1 n + 1 n + 1 n − 1 h θ̂2 − θ̂1 i = θ̂2 + 1 n − 1 h θ̂2 − θ̂1 i = M + M − m n − 1 Then T1 and T2 are unbiased estimators of θ1 and θ2. AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.67/149
  • 116. Uniformly Better Let x = (x1, x2, · · · , xn) denote the vector of observations having joint density f(x|θ) where the unknown parameter vector θ ∈ Ω. Let T(x) and T∗ (x) be estimators of the parameter φ(θ). Then T(x) is said to be uniformly better than T∗ (x) if: MSET(x) (θ) 6 MSET∗(x) (θ) , θ ∈ Ω AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.68/149
  • 117. Uniformly Minimum Variance Unbiased Estimator Let x = (x1, x2, · · · , xn) denote the vector of observations having joint density f(x|θ) where the unknown parameter vector θ ∈ Ω. Then T∗ (x) is said to be the UMVU (Uniformly minimum variance unbiased) estimator of φ(θ) if: E[T∗ (x)] = φ(θ), θ ∈ Ω Var[T∗ (x)] 6 Var[T(x)] where E[T(x)] = φ(θ). AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.69/149
  • 118. Multiple Parameter Estimation âǫ(r) =      a1(r) − a1 a2(r) − a2 . . . aK(r) − aK      = ~ aǫ(r) − ~ a Cost function for MSE criterion: C (âǫ(r)) = K X i=1 â2 ǫi (r) = âT ǫ (r)âǫ(r) AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.70/149
  • 119. Risk: RMSE = Z ∞ −∞ Z ∞ −∞ C (âǫ(r)) p (~ r,~ a) d~ rd~ a = Z ∞ −∞ p(~ r) d~ r | {z } 0 Z ∞ −∞ K X i=1 (âi(r) − ai)2 # p(a|r) d~ a ⇒ âMSEi (r) = Z ∞ −∞ ai p(~ a|~ r) d~ a or ˆ ~ aMSEi (r) = Z ∞ −∞ ~ a p(~ a|~ r) d~ a AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.71/149
  • 120. The above estimates hold true over linear transformation. ~ b = DL×K ~ a, E bT ǫ (r)bǫ(r) = E L X i=1 b2 ǫi (r) # ~ bMSE(~ r) = D~ aMSE(~ r) AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.72/149
  • 121. For MAP we find ~ a that max {p(~ a|~ r)}: ∂ ln p(~ a|~ r) ∂ai
  • 122.
  • 123.
  • 124.
  • 125. a=aMAP(r) = 0, i = 1, 2, · · · , K ∇a [ln p(~ a|~ r)]|a=aMAP(r) = 0, ∇a =    ∂ ∂a1 . . . ∂ ∂aK    For ML: ∇a [ln p(~ r|~ a)]|a=aML(r) = 0 AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.73/149
  • 126. Bias: E {~ aǫ(~ r)} = E {~ a(~ r) − ~ a} = ~ B(~ r) If ~ B(~ r) equal to zero then ~ a(~ r) is an unbiased estimate. For vector variables the quantity analogous to the variance is the covariance matrix. E n (~ aǫ − E (~ aǫ))T (~ aǫ − E (~ aǫ)) o = Λǫ E (~ aǫ) = ~ B(~ a) AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.74/149
  • 127. Let’s consider any unbiased estimator of ~ a σ2 ǫi = var [ai(~ r) − ai] Jii Jii is the ith element in the K × K square matrix J−1 Jij = E ∂ ln p(~ r|~ a) ∂ai · ∂ ln p(~ r|~ a) ∂aj This is called Fisher’s information matrix (FIM). AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.75/149
  • 128. For any estimator we are interested in: 1. Bias: E {~ a(r)} 2. Error cross-covariance: E n (~ aǫ − E (~ aǫ))T (~ aǫ − E (~ aǫ)) o AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.76/149
  • 129. Example: Consider the random variable Y with E[Y ] = g(U1, U2, · · · , Uk) = p X i=1 βiφi(U1, U2, · · · , Uk) and var(Y ) = σ2 where βi, i = 1, 2, · · · , p are unknown parameters. where φi, i = 1, 2, · · · , p are known functions of the nonrandom variables Ui, i = 1, 2, · · · , k assume further that Y is normally distributed. AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.77/149
  • 130. Thus the PDF of Y is: f(Y |β1, · · · , βp, σ2 ) = f(Y |β, σ2 ) = 1 √ 2πσ2 exp − 1 2s2 [Y − g(U1, U2, · · · , Uk)]2 = 1 √ 2πσ2 exp    − 1 2σ2 Y − p X i=1 βiφi (U1, U2, · · · , Uk) #2    = 1 √ 2πσ2 exp − 1 2σ2 [Y − β1X1 − β2X2 − · · · − βpXp]2 whereXi = φi (U1, U2, · · · , Uk) , i = 1, 2, · · · , p. AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.78/149
  • 131. Now suppose that n independent observations of Y (y1, y2, · · · , yn) corresponding to n sets of values of      (u11, · · · , u1k) (u21, · · · , u2k) . . . (un1, · · · , unk)      Let xij = φj(ui1, · · · , uik), j = 1, 2, · · · , p, i = 1, 2, · · · , n. AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.79/149
  • 132. Then the joint density of y = (y1, y2, · · · , yn) is: f(y1, · · · , yn|β1, · · · , βp, σ2 ) = f(y|β, σ2 ) = 1 (2πσ2)n/2 exp ( − 1 2σ2 n X i=1 [yi − g(u1i, u2i, ..., uki)]2 ) = 1 (2πσ2)n/2 exp    − 1 2σ2 n X i=1 yi − p X j=1 βjφj(u1i, u2i, ..., uki) #2    AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.80/149
  • 133. = 1 (2πσ2)n/2 exp    − 1 2σ2 n X i=1 yi − p X j=1 βjxij #2    = 1 (2πσ2)n/2 exp − 1 2σ2 [y − Xβ]′ [y − Xβ] = 1 (2πσ2)n/2 exp − 1 2σ2 [y′ y − 2y′ Xβ + β′ X′ Xβ] = 1 (2πσ2)n/2 exp − 1 2σ2 [β′ X′ Xβ] exp − 1 2σ2 [y′ y − 2y′ Xβ] AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.81/149
  • 134. = h (y) g β, σ2 exp − 1 2σ2 [y′ y − 2y′ Xβ] Thus f(y|β, σ2 ) is a member of the exponential family of distributions and S = (y′ y, X′ y) is a Minimal Complete set of Sufficient Statistics. AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.82/149
  • 135. The Maximum Likelihood estimates of β and σ2 are the values β̂ and σ̂2 that maximize Ly σ2 , β = 1 (2πσ2)n/2 exp − 1 2σ2 [y − Xβ]′ [y − Xβ] or equivalently ly σ2 , β = ln Ly σ2 , β AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.83/149
  • 136. = − n 2 ln (2π) − n 2 ln σ2 − 1 2σ2 [y − Xβ]′ [y − Xβ] = − n 2 ln (2π) − n 2 ln σ2 − 1 2σ2 [y′ y − 2y′ Xβ + β′ X′ Xβ] ∂ly (σ2 , β) ∂β = 0 yields the system of linear equations (The Normal Equations) X′ Xβ̂ = X′ y AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.84/149
  • 137. while ∂ly (σ2 , β) ∂σ2 = 0 yields the equation: σ̂2 = 1 n h y − Xβ̂ i′ h y − Xβ̂ i If [X′ X]−1 exists then the normal equations have solution: β̂ = (X′ X) −1 X′ y AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.85/149
  • 138. and σ̂2 = 1 n h y − Xβ̂ i′ h y − Xβ̂ i = 1 n h y − X (X′ X) −1 X′ y i′ h y − X (X′ X) −1 X′ y i = 1 n h y′ y − y′ X (X′ X) −1 X′ y i AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.86/149
  • 139. Almost all problems in statistics can be formulated as a problem of making a decision . That is given some data observed from some phenomena a decision will have to be made about the phenomena. Decisions are generally broken into two types : Estimation decisions Hypothesis Testing decisions. Probability Theory plays a very important role in these de- cisions. AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.87/149
  • 140. Besides the Normal distribution the following distributions play an important role in estimation and hypothesis testing: Chi-squared distribution with ν degrees of freedom f(x) = 1 Γ(ν/2)2ν/2 x(ν−2)/2 e−x/2 , x 0 Comment: If z1, z2, · · · , zν are independent random variables each having a standard normal distribution then U = Pν k=1 z2 k has a chi-squared distribution with ν degrees of freedom. AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.88/149
  • 141. F distribution with ν1 degrees of freedom in the numerator and ν2 degrees of freedom in the denominator f(x) = Kx(ν1−2)/2 1 + ν1 ν2 x −(ν1+ν2)/2 , x 0, K = Γ(ν1+ν2 2 ) ν1 ν2 (ν1/2) Γ(ν1/2)Γ(ν2/2) AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.89/149
  • 142. Comment: If U1 and U2 are independent random variables each having Chi-squared distribution with ν1 and ν2 degrees of freedom respectively then F = U1 U2 ν1 ν2 has a F distribution with ν1 degrees of freedom in the nu- merator and ν2 degrees of freedom in the denominator. AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.90/149
  • 143. The t distribution with ν degrees of freedom f(x) = K 1 + x2 ν −(ν+1)/2 , K = Γ((ν + 1)/2) Γ(ν/2) √ πν Comment: If Z and U are independent random variables, and Z has a standard Normal distribution while U has a Chi-squared distribution with ν degrees of freedom then t = Z p U/ν has a t distribution with ν degrees of freedom. AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.91/149
  • 144. Goal: Extract useful information out of messy data Strategy: Formulate probabilistic model of data y, which depends on underlying parameter(s) θ Terminology depends on parameter space: Detection (simple hypothesis testing): θ ∈ {0, 1}, 0 = target absent, 1 = target present Classification (multihypothesis testing): θ ∈ {0, 1, · · · , M}, , i.e.θ ∈ {DC-9, 747, F-15, MiG-31} AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.92/149
  • 145. Termonology Suppose θ = (θ1, θ2) If we are only interested in θ1, then θ2 are called nuisance parameters If θ1 = {0, 1}, and θ2 are nuisance parameters, we call it a composite hypothesis testing problem AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.93/149
  • 146. Ex: Positron Emission Tomography Simple, traditional linear DSP-based approach Filtered Back Projection (FBP) Advanced, estimation-theoretic approach Model Poisson “likelihood” of collected data Markov Random Field (MRF) “prior”on image Find estimate using expectation-maximization algorithm (or similar technique) AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.94/149
  • 147. Tasks of Statistical Signal Processing: Estimation, Detection, . . . 1. Create statistical model for measured data 2. Find fundamental limitations on our ability to perform inference on the data (a) Cramér-Rao bounds, Chernov bounds, etc. 3. Develop an optimal (or suboptimal) estimator 4. Asymptotic analysis (i.e., assume we have lots and lots of data) of estimator performance to see if it approaches bounds derived in (2) 5. Do simulations and experiments comparing algorithm performance to lower bounds and competing algorithms AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.95/149
  • 148. A Bayesian analysis treats θ as a random variable with a “prior” density p(θ) Data generating machinery is specified by a conditional density p(y|θ) Gives the “likelihood” that the data y resulted from the parameters ? Inference usually revolves around the posterior density, derived from Bayes’ theorem: p(θ|y) = p(y|θ)p(θ) p(y) = p(y|θ)p(θ) R p(y|θ)p(θ)dθ AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.96/149
  • 149. Classical detection problem: Design of optimum procedures for deciding between possible statistical situations given a random observation: H0 : Yk ∼ P ∈ P0, k = 1, · · · , n H1 : Yk ∼ P ∈ P1, k = 1, · · · , n The model has the following components: Parameter Space (for parametric detection problems) Probabilistic Mapping from Parameter Space to Observation Space Observation Space AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.97/149
  • 150. Parameter Space: Completely characterizes the output given the mapping. Each hypothesis corresponds to a point in the parameter space. This mapping is one-to-one. Probabilistic Mapping from Parameter Space to Observation Space: The probability law that governs the effect of a parameter on the observation. AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.98/149
  • 151. Example: Yk =    Nk Nk Nk , p = 1/2, , p = 1/4, , p = 1/4, Nk ∼ N(0, σ2 ) Nk ∼ N(−1, σ2 ) Nk ∼ N(1, σ2 ) µ = −1 0 1 T | {z } parameter space p = 1/2, 1/4, 1/4 is the probabilistic mapping. AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.99/149
  • 152. Observation Space: Finite dimensional, i.e. k = 1, 2, · · · , n where n is finite. Detection Rule Mapping of the observation space into its parameters in the parameter space is called a detection rule. AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.100/149
  • 153. Classical estimation problem: Interested in not making a choice among several discrete situations, but rather making a choice among a continuum of possible states. Think of a family of distributions on the observation space, indexed by a set of parameters. Given the observation, determine as accurately as possible the actual value of the parameter. AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.101/149
  • 154. Example: Yk = Nk, Nk(µ, σ2 ) In this example, given the observations, parameter µ is being estimated. Its value is not chosen among a set of discrete values, but rather is estimated as accurately as possible. AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.102/149
  • 155. Estimation problem also has the same components as the detection problem. Parameter Space Probabilistic Mapping from Parameter Space to Observation Space Observation Space Estimation Rule Detection problem can be thought of as a special case of the estimation problem. There are a variety of estimation procedures differing basically in the amount of prior information about the parameter and in the performance criteria applied. Estimation theory is less structured than detection theory. Detection is science, estimation is art(I have seen it in a book “Array signal processing” by Johnson, Dudgeon). AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.103/149
  • 156. Based on the a priori information about the parameter, there are two basic approaches to parameter estimation: Bayesian Parameter Estimation Nonrandom Parameter Estimation Bayesian Parameter Estimation: Parameter is assumed to be a random quantity related statistically to the observation. Nonrandom Parameter Estimation: Parameter is a constant without any probabilistic structure. AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.104/149
  • 157. Estimation theory relies on jargon to characterize the properties of estimators. The following definitions are used: The set of n observations are represented by the n-dimensional vector y ∈ Γ (observation space). The values of the parameters are denoted by the vector θ ∈ Λ (parameter space). The estimate of this parameter vector is denoted by : Γ → Λ. AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.105/149
  • 158. Definitions (continued): The estimation error ε(y) (ε in short) is defined by the difference between the estimate and the actual parameter: ε(y) = θ̂(y) − θ The function C(a, θ) is the cost of estimating a true value of θ as a. Given such a cost function C, the Bayes risk (average risk) of the estimator is defined by the following: r(θ̂) = E n E n C[θ̂(Y), Θ]
  • 159.
  • 160.
  • 162. Example Suppose we would like to minimize the Bayes risk defined by r(θ̂) = E n E n C[θ̂(Y), Θ]
  • 163.
  • 164.
  • 165. y oo for a given cost function C. By inspection, one can see that the Bayes estimate of θ can be found (if it exists) by minimizing, for each y ∈ Γ, the posterior cost given Y = y: E n C[θ̂(Y), Θ]
  • 166.
  • 167.
  • 169. An estimate is said to be unbiased if the expected value of the estimate equals the true value of the parameter E n θ̂|θ o = θ Otherwise the estimate is said to be biased. The bias b(θ) is usually considered to be additive, so that: b(θ) = E n θ̂|θ o − θ AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.108/149
  • 170. An estimate is said to be asymptotically unbiased if the bias tends to zero as the number of observations tend to infinity. An estimate is said to be consistent if the mean-squared estimation error tends to zero as the number of observations becomes large. AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.109/149
  • 171. An efficient estimate has a mean-squared error that equals a particular lower bound: the Cramer-Rao bound. If an efficient estimate exists, it is optimum in the mean-squared sense: No other estimate has a smaller mean-squared error. Following shorthand notations will also be used for brevity: pθ(y) = py|θ(y|θ) = Probability density(y given θ) Eθ{y} = E{y|θ} AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.110/149
  • 172. Following definitions and theorems will be useful later in the presentation: Definition: Sufficiency Suppose that Λ is an arbitrary set. A function T : Γ → Λ is said to be a sufficient statistic for the parameter set θ ∈ Λ if the distribution of y conditioned on T(y) does not depend on θ for θ ∈ Λ. If knowing T(y) removes any further dependence on θ of the distribution of y, one can conclude that T(y) contains all the information in y that is useful for estimating θ. Hence, it is sufficient. AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.111/149
  • 173. Definition: Minimal Sufficiency A function T on Γ is said to be minimal sufficient for the parameter set θ ∈ Λ if it is a function of every other sufficient statistic for θ. A minimal sufficient statistic represents the furthest reduction in the observation without destroying information about θ. Minimal sufficient statistic does not necessarily exist for every problem. Even if it exists, it is usually very difficult to identify it. AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.112/149
  • 174. Let {x1, x2, · · · , xn} denote a set of observations with joint density f(x1, x2, · · · , xn; θ1, θ2, · · · , θp) then the set of statis- tics: S1 = S1(x1, x2, · · · , xn), . . . , Sq = Sq(x1, x2, · · · , xn) called a set of sufficient statistics if the conditional distri- bution of {x1, x2, · · · , xn} given S1, S2, · · · , Sq is functionally independent of the parameters {θ1, θ2, · · · , θp}. AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.113/149
  • 175. Example Suppose that we observe a success-failure experiment (Bernoulli trial) n = 3 times. Let π denote the probability of success. Let x1, x2, x3 denote the observations where xi = 1 if the ith trial is a success 0 if the ith trial is a failure Let S = x1 + x2 + x3 = the total number of successes. AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.114/149
  • 176. Joint distribution of x1, x2, x3 ; sampling distribution of S, conditional distribution of x1, x2, x3 given S. x1, x2, x3 f(x1, x2, x3; π) S g(S; π) x1, x2, x3|S 0, 0, 0 (1 − π)3 0 (1 − π)3 1 1, 0, 0 π(1 − π)2 1/3 0, 1, 0 π(1 − π)2 1 3π(1 − π)2 1/3 0, 0, 1 π(1 − π)2 1/3 1, 1, 0 π2 (1 − π) 1/3 1, 0, 1 π2 (1 − π) 2 3π2 (1 − π) 1/3 0, 1, 1 π2 (1 − π) 1/3 1, 1, 1 π3 3 π3 1 AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.115/149
  • 177. The data x1, x2, x3 can be thought to be generated in two ways 1. Generate the data x1, x2, x3 directly from the joint density f(x1, x2, · · · , xn; θ1, θ2, · · · , θp) or 2. Generate the sufficient statistics S1, S2, · · · , Sq from their joint sampling distribution then generate the observations x1, x2, x3 from the conditional distribution of x1, x2, x3 given S1, S2, · · · , Sq. Since the second step is independent of the parameters θ1, θ2, · · · , θp all of the information about the parameters will be determined by the results of the first step. AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.116/149
  • 178. Principle of sufficiency Any decision about the parameters θ1, θ2, · · · , θp should be made using the values of the sufficient statistics S1, S2, · · · , Sq and not otherwise on the data x1, x2, x3. AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.117/149
  • 179. The Likelihood Principle Any decision about the parameters θ1, θ2, · · · , θp should be made using the values of the Likelihood function L(θ1, θ2, · · · , θp) and not otherwise on the data x1, x2, x3. AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.118/149
  • 180. x1, x2, x3 f(x1, x2, x3; π) S g(S; π) L(π) 0, 0, 0 (1 − π)3 0 (1 − π)3 (1 − π)3 1, 0, 0 π(1 − π)2 0, 1, 0 π(1 − π)2 1 3π(1 − π)2 π(1 − π)2 0, 0, 1 π(1 − π)2 1, 1, 0 π2 (1 − π) 1, 0, 1 π2 (1 − π) 2 3π2 (1 − π) π2 (1 − π) 0, 1, 1 π2 (1 − π) 1, 1, 1 π3 3 π3 π3 AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.119/149
  • 182. Let x1, x2, x3 denote a set of observations with joint density f(x1, x2, · · · , xn; θ1, θ2, · · · , θp). Then S1 = S1(x1, · · · , xn), . . . , Sq = Sq(x1, · · · , xn) are a set of sufficient statistics if the joint density satisfies: f(x1, x2, · · · , xn; θ1, θ2, · · · , θp) = g(S1, S2, · · · , Sq; θ1, θ2, · · · , θp)h(x1, · · · , xn) i. e.dependence on the parameters factors out with the Sufficient statistics. AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.121/149
  • 183. Example Let x1, x2, · · · , xn denote a sample from the normal distribution with mean µ and variance σ2 . The density of xi is: f (xi) = 1 √ 2πσ e− 1 2σ2 (xi−µ)2 And the joint density of (x1, x2, · · · , xn) is: f x1, . . . , xn; µ, σ2 = n Y i=1 1 √ 2πσ e− 1 2σ2 (xi−µ)2 AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.122/149
  • 184. f x1, . . . , xn; µ, σ2 = 1 (2πσ2) n/ 2 e − 1 2σ2 n P i=1 (xi−µ)2 = 1 (2πσ2) n/ 2 e − 1 2σ2 n P i=1 x2 i −2µ n P i=1 xi+nµ2 n X i=1 x2 i = n X i=1 (xi − x̄)2 + nx̄2 = (n − 1) s2 + nx̄2 n X i=1 xi = nx̄ AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.123/149
  • 185. f x1, . . . , xn; µ, σ2 = 1 (2πσ2) n/ 2 e− 1 2σ2 ((n−1)s2+nx̄2−2nµx̄+nµ2 ) = h (x1, . . . , xn) g x̄, s; µ, σ2 where g x̄, s; µ, σ2 = 1 (2πσ2) n/ 2 e− 1 2σ2 ((n−1)s2+nx̄2−2nµx̄+nµ2 ) h (x1, . . . , xn) = 1 Thus, x̄ and s are sufficient statistics. AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.124/149
  • 186. The Factorization Theorem: Suppose that the parameter set θ ∈ Λ has a corresponding families of densities pθ. A statistic T is sufficient for θ iff there are functions gθ and h such that pθ = gθ[T(y)]h(y) ∀y ∈ Γ and θ ∈ Λ. AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.125/149
  • 187. Example Consider the hypothesis-testing problem Λ = {0, 1} with densities p0 and p1. Noting that pθ(y) = ( p0(y) ifθ = 0 p1(y) p0(y) p0(y) ifθ = 1, the factorization pθ = gθ[T(y)]h(y) is possible with h(y) = p0(y) T(y) = p1(y)/p0(y) ≡ L(y) gθ(y) = 1 ifθ = 0 t ifθ = 1. Thus the likelihood ratio L is a sufficient statistic for the bi- nary hypothesis-testing problem. AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.126/149
  • 188. Rao-Blackwell Theorem Suppose that g(y) is an unbiased estimate of g(θ) and that T is sufficient for θ. Define g̃[T(y)] = Eθ{ĝ(Y)|T(Y) = T(y)} Then g̃[T(y)] is also an unbiased estimate of g(θ). Furthermore Varθ(g̃[T(Y)]) ≤ Varθ(ĝ(Y)), with equality iff Pθ(ĝ(Y) = g̃[T(Y)]) = 1. AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.127/149
  • 189. Let (x1, x2, · · · , xn) denote a set of observations with joint density f(x1, x2, · · · , xn; θ1, θ2, · · · , θp). Let S1 = S1(x1, x2, · · · , xn), . . . , Sq = Sq(x1, x2, · · · , xn) denote a set of sufficient statistics. Let t(x1, x2, · · · , xn) be any unbiased estimator of the parameter φ = g(θ1, θ2, · · · , θp) then there exists an unbiased estimator, T(S1, · · · , Sq) of φ such that Var(T) 6 Var(t) AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.128/149
  • 190. Proof Let T (S1, . . . , Sk) = E (t (x1, . . . , xn) |S1, . . . , Sk ) is the Conditional Expectation of tgivenS1, · · · , Sk T (S1, . . . , Sk) = Z . . . Z t (x1, . . . , xn)g (x1, . . . , xn |S1, . . . , Sk ) dx1 . . . dxn Now t is an unbiased estimator of φ = g(θ1, · · · , θp) Hence E [t] = φ Also AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.129/149
  • 191. E [t] = ES1,...,Sk [E [t |S1, . . . , Sk ]] = ES1,...,Sk [T (S1, . . . , Sk)] = φ Thus T is also an unbiased estimator of φ = g(θ1, θ2, · · · , θp) Finally V ar [t] = V arS1,...,Sk [E [t |S1, . . . , Sk ]] +ES1,...,Sk [V ar [t |S1, . . . , Sk ]] ≥ V arS1,...,Sk [T (S1, . . . , Sk)] Since ES1,...,Sk [V ar [t |S1, . . . , Sk ]] ≥ 0 QED. AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.130/149
  • 192. The Rao-Blackwell theorem states that if you have any unbi- ased estimator t of a parameter (that depends arbitrarily on the observations) then you can find a better unbiased es- timator (smaller variance) that is a function of the solely of sufficient statistics. Thus the best unbiased estimator (min- imum variance) has to be a function of the sufficient statis- tics AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.131/149
  • 193. Thus the search for the UMVU estimator (uniformly min- imum variance unbiased estimator) is amongst functions that depend solely on the sufficient statistics. AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.132/149
  • 194. Example Suppose that Γ = {0, 1, · · · , n}, Λ = {0, 1}, and pθ(y) = n! y!(n−y)! θy (1 − θ)n−y , y = 0, . . . , n, 0 θ 1 For any function f on Γ, we have Eθ{f(Y )} = n P y=0 n! y!(n−y)! f(y)θy (1 − θ)n−y = (1 − θ)n n P y=0 ayxy The condition Eθf(Y ) = 0 ∀θ ∈ Γ implies that n P y=0 ayxy = 0, ∀x 0. AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.133/149
  • 195. However, an nth order polynomial has at most n zeros un- less all of its coefficients are zero. Hence, θ ∈ Γ is com- plete. AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.134/149
  • 196. Let (x1, x2, · · · , xn) denote a set of observations with joint density f(x1, x2, · · · , xn; θ1, θ2, · · · , θp) then the set of statistics: Let S1 = S1(x1, x2, · · · , xn), . . . , Sq = Sq(x1, x2, · · · , xn) denote a set of sufficient statistics Then S1, · · · , Sq are called a set of complete sufficient statistics if whenever E(h(S1, · · · , Sq)) = 0 ⇒ h(S1, · · · , Sq) = 0 AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.135/149
  • 197. If S1 = S1(x1, x2, · · · , xn), . . . , Sq = Sq(x1, x2, · · · , xn) denote a set of sufficient statistics The S1, · · · , Sq are called a set of complete sufficient statistics if whenever E(h(S1, · · · , Sq)) = 0 ⇒ h(S1, · · · , Sq) = 0 i.e., Z · · · Z h (S1, . . . , Sk)g (S1, . . . , Sk |θ1, . . . , θp ) dS1 . . . dSk = 0 implies h (S1, . . . , Sk) = 0 Completeness is sometimes difficult to prove. AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.136/149
  • 198. Example Suppose that we observe a success-failure experiment (Bernoulli trial) n = 3 times. Let π denote the probability of success. Let x1, x2, x3 denote the observations where xi = 1 if the ith trial is a success 0 if the ith trial is a failure Let S = x1 + x2 + x3 = the total number of successes. AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.137/149
  • 199. Joint distribution of x1, x2, x3 ; sampling distribution of S, conditional distribution of x1, x2, x3 given S. x1, x2, x3 f(x1, x2, x3; π) S g(S; π) x1, x2, x3|S 0, 0, 0 (1 − π)3 0 (1 − π)3 1 1, 0, 0 π(1 − π)2 1/3 0, 1, 0 π(1 − π)2 1 3π(1 − π)2 1/3 0, 0, 1 π(1 − π)2 1/3 1, 1, 0 π2 (1 − π) 1/3 1, 0, 1 π2 (1 − π) 2 3π2 (1 − π) 1/3 0, 1, 1 π2 (1 − π) 1/3 1, 1, 1 π3 3 π3 1 AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.138/149
  • 200. S is a sufficient statistic. Is it a complete sufficient statistic? sampling distribution of S S g(S; π) 0 (1 − π)3 1 3π(1 − π)2 2 3π2 (1 − π) 3 π3 E [h (S)] = h (0) (1 − π)3 + h (1) 3π (1 − π)2 + h (2) 3π2 (1 − π) + h (3) π3 = = h (0) + 3 [h (1) − h (0)] π + 3 [h (0) − 2h (1) + h (2)] π2 + [h (3) − 3h (2) + 3h (1) + h (0)] π3 AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.139/149
  • 201. E [h (S)] = 0, for all values of π i.e., p (π) = h (0) + 3 [h (1) − h (0)] π + 3 [h (0) − 2h (1) + h (2)] π2 + [h (3) − 3h (2) + 3h (1) + h (0)] π3 = 0, ⇒ h (0) = 0, 3 [h (1) − h (0)] = 0, h (0) − 2h (1) + h (2) = 0, h (3) − 3h (2) + 3h (1) + h (0) = 0 Thus , h (0) = h (1) = h (2) = h (3) = 0 AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.140/149
  • 202. S is a complete sufficient statistic. AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.141/149
  • 203. If S1, · · · , Sq are called a set of complete sufficient statistics and T1 = h1(S1, · · · , Sq) and T2 = h2(S1, · · · , Sq) Are unbiased estimators of φ. Then T1 = T2 E(T1) = E(T2) = φ hence E(T1 − T2) = 0. and h2(S1, · · · , Sq) − h1(S1, · · · , Sq) and T2 = T1 Thus there is only one unbiased estimator of φ that is a function of complete sufficient statistics AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.142/149
  • 204. Lehmann-Scheffe Theorem Let x1, x2, · · · , xn denote a set of observations with joint density f(x1, · · · , xn; θ1, · · · , θq). Let S1 = S1(x1, · · · , xn),. . . , Sq = Sq(x1, · · · , xn) denote a set of complete sufficient statistics. AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.143/149
  • 205. Let T(S1, · · · , Sq) be an unbiased estimator of the parameter φ = g(θ1, · · · , θq) then T is the uniform minimum variance unbiased (UMVU) estimator of φ. That is if t(x1, x2, · · · , xn) is an unbiased estimator of φ then Var(T) 6 Var(t) AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.144/149
  • 206. Example We observe a success-failure experiment (Bernoulli trial) n = 3 times. the probability of success is π. Let x1, x2, x3 denote the observations where xi = 1 if the ith trial is a success 0 if the ith trial is a failure Let S = x1 + x2 + x3 = the total number of successes. E 1 3 S = E 1 3 (x1 + x2 + x3) = 1 3 (E [x1] + E [x2] + E [x3]) = 1 3 (π + π + π) = π AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.145/149
  • 207. S 3 is an unbiased estimator of π. S 3 is the uniform minimum variance unbiased (UMVU) estimator of π. AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.146/149
  • 208. The strategy to find the UMVU estimator 1. Find a set of complete sufficient statistics S1, S2, · · · , Sk. 2. To find an unbiased estimator that depends only on the set of complete sufficient statistics T(S1, S2, · · · , Sk). 3. Lehman Scheffe theorem. Maximum Likelihood estimators are functions of a set of complete sufficient statistics S1, S2, · · · , Sk. AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.147/149
  • 209. Factorization criterion L(θ1, · · · , θp) = f(x1, · · · , xn; θ1, · · · , θp) = g(S1, · · · , Sk; θ1, · · · , θp)h(x1, · · · , xn) θ1, · · · , θp will maximize L(θ1, · · · , θp) if θ1, · · · , θp will maximize g(S1, · · · , Sk; θ1, · · · , θp) These will depend on S1, · · · , Sk. AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.148/149
  • 210. 1. Finding Maximum Likelihood estimators. 2. Checking if there is a set of complete sufficient statistics -S1, · · · , Sk. 3. Checking if the maximum Likelihood estimators they are unbiased. 4. Making adjustments to these estimators if they are not unbiased. This is the standard way of finding UMVU estimators AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.149/149
  • 211. C a r e f u l Being Careful • Suppose Y has the density p(y; µ) = n Y i=1 1 aσc exp − yi − µ aσ 4 # where a ≈ 1.4464 and c is a constant which makes the density normalize to 1. l(µ) = − n X i=1 yi − µ aσ 4 • Try usual “set derivative equal to zero” dl(µ) dµ = 4 n X i=1 yi − µ aσ 3 = 0 • This will be a 3rd order polynomial in µ, which will in general have 3 solutions • Would have to just compute the loglikelihood for each solution and see which one gives the biggest result • Suppose Y = Y1, · · · , Yn, is i.i.d. with a Cauchy-like density (but not really 1
  • 212. C a r e f u l Cauchy!) (Maximum Penalized Likelihood Estimation) p(y; c, γ) = n Y i=1 γ 2 (γ + |yi − c|) 2 • Loglikelihood: l(c, γ) = n ln γ − 2 n X i=1 ln (γ + |yi − c|) l(y; c, γ) = n ln γ − 2 X i∈{yi≥c} ln[γ + (yi − c)] −2 X i∈{yic} ln[γ + (c − yi)] • Suppose γ is given and we want to estimate c taking derivative: dl(y; c) dc = −2 X i∈{yi≥c} −1 γ + yi − c −2 X i∈{yi≥c} 1 γ + c − yi = 2 n X i=1 sign(yi − c) γ + |yi − c| • We would be tempted to say that the ML 2
  • 213. C a r e f u l estimator of c is just the solution of dl(y; c) dc = 2 n X i=1 sign(yi − c) γ + |yi − c| = 0 • Well. . . theres actually more than one solution to this, so how about picking the solution which gives the greatest likelihood? • but that is really a trap!!! • Let’s check out the second derivative: d2 l(y; c) dc2 = 2 n X i=1 [sign(yi − c)]2 (γ + |yi − c|) 2 0 • Those critical points were really local minima!!! • Where is the real maxima? • Notice that |x| is not differentiable at x = 0, hence p(y; c, γ) = n ln γ − 2 n X i=1 ln (γ + |yi − c|) 3
  • 214. C a r e f u l is not differentiable at c=any of the data points! • To get the real ML estimate, try each yi for c, and see which one gives the biggest likelihood • “It turns out” the ML estimate is one of the median points 4
  • 215. E M Expectation-Maximization Algorithm • The EM procedure is a way of making iterative algorithms for maximizing loglikelihoods or Bayesian logposteriors when no closed-form solution is available • There’s a more powerful and more general EM formulation by Csiszar based on Information Theory EM Algorithm • Incomplete data Y, that we actually measure • Goal: maximize the incomplete data loglikelihood(function of specific collected data) lid(θ) = log pY (y; θ) • Complete data Z, a hypothetical data set • Tool: complete data loglikelihood (function of complete data as a random variable) lcd(θ) = log pZ(z; θ)|z=Z = log pZ(Z; θ) • Complete data space must be “larger” and 1
  • 216. E M determine the incomplete data, i.e. there must be a many-to-one mapping y = h(z) The EM Recipe • Step 1: Decide on a complete data space • Step 2: The expectation step Q(θ; θ̂old ) = E[lcd(θ)|Y = y; θ̂old ] • Step 3: The maximization step θ̂new = arg max θ Q(θ; θ̂old ) • Start with a feasible initial guess θ̂old then iterate steps 2 and 3 (which can usually be combined) What is that Expectation? E[lcd|Y = y; θ̂old ] = Z pZ|Y (z|y; θ̂old ) log pZ(z; θ)dz pZ|Y (z|y; θ̂old ) =      pZ (z;θ̂old ) R Z(y) pZ (z̃;θ̂old)dz̃ z ∈ Z(y) 0 z / ∈ Z(y) 2
  • 217. E M Z Z(y) pZ(z; θ̂old )dz = pY (y; θ̂old ) Aspects of EM Algorithms • Incomplete data loglikelihood is guaranteed to increase with each EM iteration • Must be careful; might converge to a local maxima which depends on the starting point • Often, the estimates naturally stay in the feasible space (i.e., nonnegativity constraints • In many problems, a candidate complete data space naturally suggests itself Ex: Poisson Signal in Additive Poisson Noise Y = S + N S ∼ Poisson(θ), N ∼ Poisson(λN ), • Incomplete-data loglikelihood is lid(θ) = −(θ + λN ) + y ln(θ + λN ), 3
  • 218. E M • ML estimator can be found in closed form θ̂(y) = max(0, y − λN ) Choose the Complete Data • Can often choose the complete data in several different ways; try to choose to make remaining steps easy • Different choices lead to different algorithms; some will converge “faster” than others. • Here, take complete data to be Z = (S, N); suppose we could magically measure the signal and noise counts separately! • Complete data loglikelihood is: lcd(θ) = [−θ + S ln(θ)] + [−λN + N ln(λN )] The E-Step Q(θ; θ̂old ) = E[lcd(θ)|Y = y; θ̂old ] = E[−(θ + λN ) + S ln(θ) + N ln(λN )|y; θ̂old ] = −(θ + λN ) + E[S|y; θ̂old ] ln(θ) 4
  • 219. E M +E[N|y; θ̂old ] ln(λN ) • Often convenient to leave explicit computation of conditional expectation until the last minute • As with loglikelihoods, we sometimes drop terms which are constants w.r.t. θ The M-Step θ̂new = arg max θ≥0 Q(θ; θ̂old ) • Take derivative as usual d dθ Q(θ; θ̂old ) = −1 + E[S|y; θ̂old ] θ • Setting equal to zero yields θ̂new = E[S|y; θ̂old ] • Now we just have to compute that expectation. (That’s usually the hardest part.) That Conditional Expectation E[S|y; θ̂old ] = Z spS(s|y; θ̂old )ds 5
  • 220. E M • Let’s look at the conditional density: pS(s|y; θ̂old ) = pY |S(y|s; θ̂old )pS(s; θ̂old ) pY (y; θ̂old) = exp[−λN ]λy−s N (y−s)! I(y ≥ s)exp[−θ̂old ](θ̂old )s s! exp[−(θ̂old + λN )](θ̂old + λN )y/y! = y! s!(y − s)! λy−s N (θ̂old + λN )y−s (θ̂old )s (θ̂old + λN )s I(s ≤ y) • We observe that the conditional density is just binomial. For 0 6 s 6 y, pS (s|y; θ̂ old ) =   y s     θ̂old θ̂old + λN   s λN θ̂old + λN !y−s E[S|y; θ̂old ] = y θ̂old θ̂old + λN • So this particular EM algorithm is: θ̂new = E[S|y; θ̂old ] = y θ̂old θ̂old + λN • Let’s see if our analytic formula for the maximizer, θ̂ = max(0, y − λN ), is a fixed point for the EM iteration 6
  • 221. E M • For y λN , θ̂new = y θ̂old θ̂old + λN y − λN = y y − λN y − λN + λN y − λN = y − λN • For y λN , immediately get 0=0 • So everything is good Back in Bayesian land • EM also good for MAP estimation; just add the logprior to the Q-function θ̂new = arg max θ≥0 QP (θ; θ̂old ) QP (θ; θ̂old ) = E[lcd|Y = y; θ̂old ] + log p(θ) • Consider previous example, with an exponential prior with mean 1/a QP (θ; θ̂old ) = −θ + E[S|y; θ̂old ] ln(θ) − aθ QP (θ; θ̂old ) dθ = −1 + E[S|y; θ̂old ] θ − a 7
  • 222. E M θ̂old = E[S|y; θ̂old ] 1 + a = θ̂old θ̂old + λN ! y 1 − a Expectation-Maximization Algorithm (Theory) Convergence of the EM Algorithm • We’d like to prove that the likelihood goes up with each iteration: Lid(θ̂new ) ≥ Lid(θ̂old ) • Recall from last pages: pZ|Y (z|y; θ) = pZ(z; θ) pY (y; θ) z ∈ Z(y) = {z : h(z) = y} ln pY (y; θ) = ln pZ(z; θ) − ln pZ|Y (z|y; θ) • Multiply both sides by the same thing and integrate with respect to z : R pZ|Y (z|y; θ̂old ) ln pY (y; θ)dz = R pZ|Y (z|y; θ̂old ) ln pZ(z; θ)dz − R pZ|Y (z|y; θ̂old ) ln pZ|Y (z|y; θ)dz 8
  • 223. E M • simplifies to: Lid(θ) = Q(θ; θ̂old )− Z pZ|Y (z|y; θ̂old ) ln pZ|Y (z|y; θ)dz z • Evaluate z at θ = θ̂new and θ = θ̂old Lid(θ̂new ) = Q(θ̂new ; θ̂old )− Z pZ|Y (z|y; θ̂old ) ln pZ|Y (z|y; θ̂new )dz, ♠ Lid(θ̂old ) = Q(θ̂old ; θ̂old )− Z pZ|Y (z|y; θ̂old ) ln pZ|Y (z|y; θ̂old )dz, ♣ • Subtract ♣ from ♠ Lid(θ̂new ) − Lid(θ̂old ) = Q(θ̂new ; θ̂old ) − Q(θ̂old ; θ̂old ) − R pZ|Y (z|y; θ̂old ) ln pZ|Y (z|y;θ̂new ) pZ|Y (z|y;θ̂old) dz Lid(θ̂new ) − Lid(θ̂old ) = Q(θ̂new ; θ̂old ) − Q(θ̂old ; θ̂old ) − R pZ|Y (z|y; θ̂old ) ln pZ|Y (z|y;θ̂new ) pZ|Y (z|y;θ̂old) dz • A really helpful inequality: ln x 6 x − 1 Lid(θ̂new ) − Lid(θ̂old ) ≥ Q(θ̂new ; θ̂old ) − Q(θ̂old ; θ̂old ) − Z pZ|Y (z|y; θ̂old ) pZ|Y (z|y; θ̂new ) pZ|Y (z|y; θ̂old ) − 1 # dz | {z } focus on this term 9
  • 224. E M Z pZ|Y (z|y; θ̂old ) pZ|Y (z|y; θ̂new ) pZ|Y (z|y; θ̂old) − 1 # dz = Z pZ|Y (z|y; θ̂new ) − pZ|Y (z|y; θ̂old )dz = Z pZ|Y (z|y; θ̂new )dz− Z pZ|Y (z|y; θ̂old )dz = 0 • Now we have: Lid(θ̂new )−Lid(θ̂old ) ≥ Q(θ̂new ; θ̂old )−Q(θ̂old ; θ̂old ) • Recall the definition of the M-step: θ̂new = arg max θ Q(θ; θ̂old ) • So, by definition Q(θ̂new ; θ̂old ) ≥ Q(θ̂old ; θ̂old ) ⇒ Lid(θ̂new ) ≥ Lid(θ̂old ) • Notice we showed that the likelihood was nondecreasing; that doesn’t automatically 10
  • 225. E M imply that the parameter estimates converge • Parameter estimate could slide along a contour of constant loglikelihood • Can prove some things about parameter convergence in special cases Ex: EM Algorithm for Imaging from Poisson Data (i.e. Emission Tomography) Generalized EM Algorithms • Recall this line: Lid(θ̂new )−Lid(θ̂old ) ≥ Q(θ̂new ; θ̂old )−Q(θ̂old ; θ̂old ) • What if the M-step is too hard? Try a “generalized” EM algorithm: θ̂new = some easy to compute θ such thatQ(θ; θ̂old ) ≥ Q(θ̂old ; θ̂old ) • Problem: EM algorithms tend to be slow • Observation: “Bigger” complete data spaces result in slower algorithms than “smaller” complete data spaces 11
  • 226. E M • SAGE (Space-Alternating Generalized Expectation-Maximization) – Split big complete data space into several smaller “hidden” data spaces – Designed to yield faster convergence • Generalization of “ordered subsets” EM algorithm 12
  • 227. W i e n e r Wiener Filtering • Context: Bayesian linear MMSE estimation for random sequences • Parameter sequence {Θk, k ∈ Z} • Data sequence {Yk, k ∈ I ⊂ Z} • Goal: Estimate {θk} as a linear function of the observations: θ̂k(y) = X j∈I h(k, j)yj • Find h to minimize mean square error • By the orthogonality principle E[(θ̂k(Y ) − Θk)Y ∗ i ] = 0fori ∈ I E[( X j∈I h(k, j)Yj − Θk)Y ∗ i ] = 0 X j∈I h(k, j)E[YjY ∗ i ] = E[ΘkY ∗ i ] X j∈I h(k, j)rY (j, i) = rΘY (k, i) | {z } This is Wiener-Hopf equation 1
  • 228. W i e n e r • If processes are stationary, we can write X j∈I h(k, j)rY (j − i) = rΘY (k − i) • If I = Z it turns out the filter is LTI: X j∈I h(k − j)rY (j − i) = rΘY (k − i) • consider i = 0 X j∈I h(k − j)rY (j) = rΘY (k) • Can solve W-H in the Z-transform domain: H(z)SY (z) = SΘY (z) H(z) = SΘY (z) SY (z) • MSE: MSE = E[|(Θk − θ̂(Y ))|2 ] = E[(Θk − θ̂(Y ))(Θ∗ k − θ̂∗ (Y ))] = E[(Θk − θ̂(Y ))Θ∗ k] +E[(Θk − θ̂(Y ))θ̂∗ (Y )] = E[ΘkΘ∗ k] − E[θ̂k(Y )Θ∗ k]] 2
  • 229. W i e n e r MSE = E[ΘkΘ∗ k] − E[θ̂k(Y )Θ∗ k]] = E[ΘkΘ∗ k] − E[ X j∈I h(k − j)YjΘ∗ k]] = E[ΘkΘ∗ k] − X j∈I h(k − j)E[YjΘ∗ k] = rΘ(0) − X j∈I h(k − j)rY Θ(j − k) • Since everything is stationary, can just take k = 0 MSE = rΘ(0) − (h ∗ rY Θ)(0) MSE = rΘ(0) − (h ∗ rY Θ)(0) = π Z −π SΘ(ω) − H(ω)SY Θ(ω)dω = π Z −π SΘ(ω) − SΘY (ω) SY (ω) SY Θ(ω)dω = π Z −π SΘ(ω) − |SΘY (ω)|2 SY (ω) dω 3
  • 230. W i e n e r Deblurring • Suppose object is observed through a blurring point spread function f and additive noise W Yk = (f ∗ Θ)k + Wk • Suppose Θ and W are uncorrelated zero-mean SY = |F| 2 Sθ + SW , and SΘY = F∗ SY • So the Wiener filter is H(z) = SΘY (z) SY (z) = F∗ (z)SΘ(z) |F(z)| 2 Sθ(z) + SW (z) Interpretation of the Deblurring Filter If noise is negligible, i.e. SW (ω) ≈ 0 H(ω) = F∗ Sθ |F| 2 Sθ + SW ≈ F∗ Sθ FF∗Sθ = 1 F • Even if there is no noise, in implementation, straight division by F(ω) is often ill-posed and not a good idea 4
  • 231. W i e n e r (round off errors, etc.) Deblurring Error MSE = π Z −π SΘ(ω) − |SΘY (ω)|2 SY (ω) dω = π Z −π SΘ(ω) − |F(ω)| 2 |SΘ(ω)|2 |F(ω)| 2 SΘ(ω) + SW (ω) dω = π Z −π SΘ[|F| 2 SΘ + SW ] − |F| 2 |SΘ|2 |F| 2 SΘ + SW dω = π Z −π SΘSW |F| 2 SΘ + SW dω Competing Approaches • Competing approaches include iterative methods such as the “Richardson-Lucy” algorithm (an EM-style procedure) – Computationally intensive – Can naturally incorporate nonnegativity – Sometimes better match to real 5
  • 232. W i e n e r statistics Discussion • Advantage of Wiener approach: – LTI filtering implementation • Disadvantages of Wiener approach: – No natural way to incorporate nonnegativity constraints (in image processing, for instance) – Only truly optimal for Gaussian statistics Real-Time Wiener Filtering • What if we don’t have “future” measurements? • Must restrict h to be causal • Solution: H(z) = 1 S− Y (z) SΘY (z) S+ Y (z) + where the meaning of the plus and minus superscripts and subscripts will be defined 6
  • 233. W i e n e r on later slides Spectral Factorization • If Y has a spectrum satisfying the Paley-Wiener criterion: π Z −π log SY (ω)dω −∞ then the spectrum can be factored as SY (ω) = S+ Y (ω)S− Y (ω) where F−1 {S+ Y } is causal and F−1 {S− Y } is anticausal Factoring Rational Spectra • If the spectrum is a ratio of polynomials, we can factor as SY (z) = S+ Y (z) | {z } Poles and zeros inside unit circle S− Y (z) | {z } Poles and zeros outside unit circle = S+ Y (z)S+ Y (z−1 ) 7
  • 234. W i e n e r • Aside: spectral factorization into causal and anticausal factors is analogous to Cholesky decomposition of a covariance matrix into lower and upper triangular factors Causal Part Extraction • We can split f into its causal and anticausal parts: f(k) = {f(k)}+ | {z } causal + {f(k)}− | {z } anticausal f(k)+ = f(k)u(k), {f(k)}− = f(k)u(−k−1) • Use similar notation for Z-transform domain F(z) = {F(z)}+ + {F(z)}− {F}+ = Z{Z−1 {F}u(k)} {F}− = Z{Z−1 {F}u(−k − 1)} How to Extract Causal Parts • If F is a ratio of polynomials can usually 8
  • 235. W i e n e r do a partial fraction expansion: F(z) = {F(z)}+ | {z } ? Poles and zeros inside unit circle + {F(z)}− | {z } ? Poles and zeros outside unit circle • Can also do polynomial long division 9
  • 236. C h e r n o ff Chernoff Bounds (Theory) • General purpose likelihood ratio test p(y; H1) p(y; H0) or p(y|H1) p(y|H0) H1 H0 λ • Consider the loglikelihood ratio test L ≡ ln Λ = ln p(y|H1) p(y|H0) H1 H0 ln λ ≡ γ • Conditional error probabilities: PD = ∞ Z γ pL|H1 (ℓ|H1)dℓ, PF A = ∞ Z γ pL|H0 (ℓ|H0)dℓ • it is often difficult, if not impossible, to find simple formulas for pL|H1 (ℓ|H1), pL|H0 (ℓ|H0) • Makes computing probabilities of detection and false alarm difficult – We could use Monte Carlo simulations, but those are cumbersome 1
  • 237. C h e r n o ff – Alternative: find easy to compute, analytic bounds on the error probabilities • Discussion based on Van Trees A Moment Generating Function ΦL|H0 (s) = E[esL |H0] = Z ∞ −∞ esℓ pL(ℓ|H0)dℓ = Z Y exp[sL(y)]pY (y|H0)dy = Z Y exp s ln pY (y|H1) pY (y|H0) pY (y|H0)dy = Z Y pY (y|H1) pY (y|H0) s pY (y|H0)dy = Z Y [pY (y|H1)] s [pY (y|H0)] 1−s dy • Define a new random variable Xs (for various values of s) with density pXs (x) ≡ esx pL|H0 (x|H0) ∞ R −∞ esℓpL|H0 (ℓ|H0)dℓ 2
  • 238. C h e r n o ff µ(s) ≡ ln ΦL|H0 (s) = ln ∞ Z −∞ esL p(ℓ|H0)dℓ µ̇(s) = R ∞ −∞ ℓesℓ p(ℓ|H0)dℓ R ∞ −∞ esℓp(ℓ|H0)dℓ = E[Xs] µ̈(s) = var[Xs] µ̇(0) = R ∞ −∞ ℓe0ℓ p(ℓ|H0)dℓ R ∞ −∞ e0ℓp(ℓ|H0)dℓ = E[L|H0] µ̇(1) = R ∞ −∞ ℓp(ℓ|H1) p(ℓ|H0) p(ℓ|H0)dℓ R ∞ −∞ p(ℓ|H1) p(ℓ|H0) p(ℓ|H0)dℓ = E[L|H1] • with µ(s) = ln ΦL|H0 (s) = ln ∞ Z −∞ esL p(l|H0)dl • Then ∞ Z γ exp[µ(s) − sx]pXs (x)dx 3
  • 239. C h e r n o ff = ∞ Z γ exp[µ(s)]e−sx esx pL|H0 (x|H0) ∞ R −∞ esℓpL|H0 (ℓ|H0)dℓ dx = ∞ Z γ pL|H0 (x|H0)dx = PF A PF A = ∞ Z γ exp[µ(s) − sx]pXs (x)dx = eµ(s) ∞ Z γ e−sx pXs (x)dx ≤ eµ(s) ∞ Z γ e−sγ pXs (x)dx = exp[µ(s) − sγ] ∞ Z γ pXs (x)dx ≤ exp[µ(s) − sγ] PF A ≤ exp[µ(s) − sγ] • We want the s ≥ 0 which makes the RHS as small as possible d ds [µ(s) − sγ] = µ̇(s) − γ, 4
  • 240. C h e r n o ff µ̇(s) = γ • Assuming everything worked (things exist, equation for maximizing s solvable, etc.): PF A ≤ exp[µ(s) − sµ̇(s)] PM ≤ exp[µ(s) + (1 − s)γ] • We want the s ≤ 1 which makes the RHS as small as possible d ds [µ(s) + (1 − s)γ] = µ̇(s) − γ µ̇(s) = γ • Assuming everything worked (things exist, equation for maximizing s solvable, etc.): PM ≤ exp[µ(s) + (1 − s)µ̇(s)] PF A ≤ exp[µ(s) − sµ̇(s)], 0 6 s 6 1 PM ≤ exp[µ(s)+(1−s)µ̇(s)], where γ = µ̇(s) µ̇(0) ≤ γ ≤ µ̇(1) E[L|H0] ≤ γ ≤ E[L|H1] Why is this useful? L can often be easily 5
  • 241. C h e r n o ff described by its moment generating function. • Let s = sm satisfy µ̇(sm) = γ = 0 Pe = 1 2 PF A + 1 2 PM ≤ 1 2 exp[µ(s)] Z ∞ 0 pXs (x)dx+ 1 2 exp[µ(s)] Z 0 −∞ pXs (x)dx Pe ≤ 1 2 exp[µ(sm)] PF A = eµ(s) ∞ Z µ̇(s) e−sx pXs (x)dx = exp[µ(s)−sµ̇(s)] ∞ Z µ̇(s) exp[+s(µ̇(s) − x)]pXs (x)dx = exp[µ(s)−sµ̇(s)] ∞ Z 0 exp[−s p µ̈(s)z]pZ(z)dz where Z = Xs − E[Xs] p var[Xs] = Xs − µ̇(s) p µ̈(s) 6
  • 242. C h e r n o ff exp[µ(s) − sµ̇(s)] ∞ Z 0 exp[−s p µ̈(s)z]pZ(z)dz | {z } Original Chernoff inequality was formed by replacing this with 1. We can get a tighter constant in some asymptotic cases. Asymptotic Gaussian Approximation • In some cases, Z approaches a Gaussian random variable as the number of samples n grows large (ex: data points iid with finite means and variances) ∞ Z 0 exp[−s p µ̈(s)z] 1 √ 2π exp z2 2 dz = exp s2 µ̈(s) 2 Q(s p µ̈(s)) 7
  • 243. C h e r n o ff PF A = exp[µ(s)−sµ̇(s)] ∞ Z 0 exp[−s p µ̈(s)z]pZ(z)dz ≈ exp[µ(s)−sµ̇(s)] exp s2 µ̈(s) 2 Q(s p µ̈(s)) • If s p µ̈(s) 3 we can approximate Q(·) using an upper bound Q(a) ≤ 1 a √ 2π exp − a2 2 PF A ≈ 1 p 2πs2µ̈(s) exp[µ(s) − sµ̇(s)] Similar Analysis Works for PM PM ≈ eµ(s)+(1−s)µ̇(s) exp (s − 1)2 µ̈(s) 2 Q((1−s) p µ̈(s)) • If (1 − s) p µ̈(s) 3 we can approximate Q(·) using the upper bound PM ≈ 1 p 2π(1 − s)2µ̈(s) exp[µ(s) + (1 − s)µ̇(s)] Asymptotic Analysis for Pe 8
  • 244. C h e r n o ff • For the case of equal priors and equal costs, if the conditions for the approximation for Q(·) to be valid on the previous to slides holds, we have Pe ≈ 1 2sm(1 − sm) p 2πµ̈(sm) exp[µ(sm)] 9
  • 245. C h e r n o ff Chernoff Bounds (Gaussian Examples) Consider the loglikelihood ratio test L ≡ ln Λ = ln p(y|H1) p(y|H0) H1 H0 ln λ ≡ γ Main object of interest: µ(s) ≡ ln ΦL|H0 (s) ΦL|H0 (s) = E[esL |H0] = Z ∞ −∞ esl pL(l|H0)dl = Z Y [p(y|H1)] s [p(y|H0)] 1−s dy Both representations will be useful 10
  • 246. C h e r n o ff Gaussian, Equal Variances H1 ∼ N(m, σ2 ), H0 ∼ N(0, σ2 ) µ(s) = ln Z Y [p(y|H1)] s [p(y|H0)] 1−s dy = ln ∞ Z −∞ · · · ∞ Z −∞ ( n Y i=1 1 √ 2πσ2 exp − (yi − m)2 2σ2 )s × ( n Y i=1 1 √ 2πσ2 exp − y2 i 2σ2 )1−s dy1 · · · dyn = n ln ∞ Z −∞ 1 √ 2πσ2 exp − (y − m)2 s + y2 (1 − s) 2σ2 dy 11
  • 247. C h e r n o ff Completing the Square ∞ Z −∞ 1 √ 2πσ2 exp − (y − m)2 s + y2 (1 − s) 2σ2 dy = ∞ Z −∞ 1 √ 2πσ2 exp − (y2 − 2my + m2 )s + y2 (1 − s) 2σ2 dy = ∞ Z −∞ 1 √ 2πσ2 exp − y2 − 2msy + m2 s 2σ2 dy = ∞ Z −∞ 1 √ 2πσ2 exp − (y2 − 2msy + m2 s2 ) − m2 s2 + m2 s 2σ2 12
  • 248. C h e r n o ff Finish Computing the µ = ∞ Z −∞ 1 √ 2πσ2 exp − (y2 − 2msy + m2 s2 ) − m2 s2 + m2 s 2σ2 = ∞ Z −∞ 1 √ 2πσ2 exp − (y − ms)2 2σ2 exp − m2 s(1 − s) 2σ2 dy = exp m2 s(s − 1) 2σ2 = tmp µ(s) = n ln {tmp} = s(s − 1) 2 nm2 σ2 ≡ s(s − 1) 2 d2 13
  • 249. C h e r n o ff Basic Bound on PFA µ(s) = s(s − 1) 2 d2 , µ̇(s) = 2s − 1 2 d2 PF A ≤ exp[µ(s) − sµ̇(s)], for 0 ≤ s ≤ 1 = exp s(s − 1) 2 d2 − s (2s − 1) 2 d2 = exp − s2 2 d2 where γ = µ̇(s), γ = 2s − 1 2 d2 s = γ d2 + 1 2 14
  • 250. C h e r n o ff Basic Bound on PM Pm ≤ exp[µ(s) + (1 − s)µ̇(s)] = exp s(s − 1) 2 d2 + (1 − s) (2s − 1) 2 d2 = exp s2 − s 2 d2 + 2s − 1 − 2s2 + s 2 d2 = exp 2s − 1 − s2 2 d2 = exp − (1 − s)2 2 d2 15
  • 251. C h e r n o ff Where are the Bounds Meaningful? Recall we need E[L|H0] ≤ γ ≤ E[L|H1] µ̇(0) ≤ γ ≤ µ̇(1) 2 · 0 − 1 2 d2 ≤ γ ≤ 2 · 1 − 1 2 d2 − d2 2 ≤ γ ≤ d2 2 16
  • 252. C h e r n o ff The Refined Bound for PFA Recall the refined asymptotic bound: PF A ≈ exp[µ(s) − sµ̇(s)] exp   s2µ̈(s) 2   Q(s q µ̈(s)) µ̇(s) = 2s − 1 2 d2 , µ̈(s) = d2 In this case, since L is a sum of Gaussian random variables, the expression is exact: PF A = exp − s2 2 d2 exp s2 d2 2 Q(sd) = Q(sd) 17
  • 253. C h e r n o ff The Refined Bound for PM PM ≈ eµ(s)+(1−s)µ̇(s) exp (s − 1)2 µ̈(s) 2 Q((1−s) p µ̈(s)) = exp − (1 − s)2 2 d2 exp (s − 1)2 d2 2 Q((1−s)d) Again, since L is Gaussian, the expression is exact: PM = Q((1 − s)d) 18
  • 254. C h e r n o ff Minimum Prob. of Error For minimum prob. of error test, γ = 0 sM = γ d2 + 1 2 = 1 2 Recall approximate expression for Pe from last slide of last lecture Pe ≈ 1 2sm(1 − sm) p 2πµ̈(sm) exp[µ(sm)] = 1 2sm(1 − sm) √ 2πd2 exp sm(sm − 1) 2 d2 19
  • 255. C h e r n o ff Min. Prob. of Error Cont Pe ≈ 2 √ 2πd2 exp − d2 8 Recall the exact expression is: Pe = Q(d/2) Van Trees’ rule of thumb: approximation is very good for d 6 20
  • 256. C h e r n o ff The Bhattacharyya Distance If the criterion is the minimum prob. of error and µ(s) is symmetric about s = 1/2, then µ(s) = ln Z Y p p(y|H1) p p(y|H0)dy µ(s) is called the Battacharyya distance 21
  • 257. C h e r n o ff Gaussian, Equal Means H1 ∼ N(0, σ2 1), H0 ∼ N(0, σ2 0), µ(s) = n 2 ln (σ2 0)s (σ2 1)1−s sσ2 0 + (1 − s)σ2 1 A common special case: σ2 1 = σ2 s + σ2 n, σ2 0 = σ2 n µ(s) = n 2 (1 − s) ln 1 + σ2 s σ2 n − ln 1 + (1 − s) σ2 s σ2 n Gaussian, Equal Means µ̇(s) = n 2 − ln 1 + σ2 s σ2 n + σ2 s /σ2 n 1 + (1 − s)σ2 s /σ2 n µ̈(s) = n 2 σ2 s /σ2 n 1 + (1 − s)σ2 s /σ2 n 2 22
  • 258. U M P Uniformly Most Powerful Tests • Usual parametric data model p(y; θ) • Consider a composite problem: H0 : θ = θ0, H1 : θ ∈ S1 • A test φ∗ is uniformly most powerful of level α = PF A if it has a better PD (or at least as good as) than any other αlevel test PD(φ∗ ; θ) = Eθ[φ∗ ] ≥ Eθ[φ] = PD(φ, θ) for all θ ∈ S1, for all φ α = PF A(φB) = PF A(φA) = PF A(φ∗ ) PD(φB,θ) PD(φA,θ) PD(φ∗ ,θ) Figure 1: UMP test. 1
  • 259. U M P • Find the most powerful α-level (recall α = PF A) test for a fixed θ • Just the Neyman-Pearson test If the decision regions do not vary with θ, then the test is UMP Gaussian Mean Example • Suppose we have n i.i.d samples Yi ∼ N(µ, σ2 ) • Assume σ2 is known, but µ is not • Consider three cases H0 : µ = 0 Case I : H1 : µ 0 Case II : H1 : µ 0 Case III : H1 : µ 6= 0 Suffices to use ȳ = 1 n n X i=1 yi 2
  • 260. U M P Ȳ ∼ N(µ, σ2 /n) Λ(y; µ) = p(y; µ) p(y; 0) = exp[−(ȳ − µ)2 /(2σ2 /n)] exp[−ȳ2/(2σ2/n)] = exp[(−ȳ2 + 2ȳµ − µ2 )/(2σ2 /n)] exp[−ȳ2/(2σ2/n)] = exp nµ σ2 ȳ − nµ2 2σ2 H1 H0 τ √ nµ σ ȳ H1 H0 ln τ + nµ2 2σ2 σ √ n ≡ γ • Case I: µ 0 √ nµ σ ȳ H1 H0 γ −→ √ n σ ȳ H1 H0 γ µ = γ+ • Set the threshold to get the right “level” α = PF A = Pr √ n σ Ȳ γ+
  • 261.
  • 262.
  • 263.
  • 264. H0 = Q(γ+ ) γ+ = Q−1 (α) • Notice the test does not depend on µ; 3
  • 265. U M P hence, it is UMP PD = Pr √ n σ Ȳ γ+
  • 266.
  • 267.
  • 268.
  • 269. H1 = Pr √ n σ (Ȳ − µ) γ+ − √ n σ µ
  • 270.
  • 271.
  • 272.
  • 273. H1 = Q γ+ − √ n σ µ = Q Q−1 (α) − √ n σ µ ≡ Q(Q−1 (α) − d) • Case II: µ 0 √ nµ σ ȳ H1 H0 γ −→ √ n σ ȳ H0 H1 γ µ = γ+ α = PF A = Pr √ n σ Ȳ γ−
  • 274.
  • 275.
  • 276.
  • 277. H0 = 1−Q(γ− ) γ− = Q−1 (1 − α) Case II is UMP also • Power of the Single-Sided Test (Case II) PD = Pr √ n σ Ȳ γ−
  • 278.
  • 279.
  • 280.
  • 281. H1 4
  • 282. U M P = Pr √ n σ (Ȳ − µ) γ− − √ n σ µ
  • 283.
  • 284.
  • 285.
  • 286. H1 = 1−Q γ− − √ n σ µ = 1−Q Q−1 (1 − α) − √ n σ µ = 1 − Q(Q−1 (1 − α) − d) • Case III: µ 6= 0 √ nµ σ ȳ H1 H0 γ • we cant just absorb µ into the threshold anymore without effecting the inequalities! • Decision region varies with sign of µ • No UMP test exists!!! • Cauchy Median Example • Suppose we have a single sample from the density p(y; θ) = 1 π 1 1 + (y − θ)2 and we want to decide H0 : θ = 0, H1 : θ 0 5
  • 287. U M P • Likelihood ratio is p(y; θ) p(y; 0) = 1 + y2 1 + (y − θ)2 H1 H0 τ • Decision region depends on θ • so no UMP exists! • The Monotone Likelihood Ratio(MLR) Condition H0 : θ = θ0, H1 : θ ∈ S1 • Suppose we have a Fisher Factorization p(y; θ) = g(T(y), θ)h(y) • A UMP test of any level α exists if the likelihood ratio is either monotone increasing or decreasing in T for all Λ(y; θ) = p(y; θ) p(y; θ0) = g(T, θ) g(T, θ0) ≡ Λ(T; θ) • Densities Satisfying MLR Condition • Suppose we have a one-sided test: H0 : θ = θ0, H1 : θ θ0 6
  • 288. U M P or H0 : θ = θ0, H1 : θ θ0 • The following satisfy the MLR condition: – i.i.d. samples from 1-D exponential family (Gaussian, Bernoulli, Exponential, Poisson, Gamma, Beta) – i.i.d. samples from uniform density U(0, θ) – i.i.d. samples from shifted Laplace Also works for H0 : θ θ0, H1 : θ θ0 • Densities Not Satisfying MLR Condition • Gaussian with single-sided H1 on mean but unknown variance • Cauchy density with single-sided H1 on centrality parameter • Exponential family with double-sided H1 7
  • 289. L M P T Locally Most Powerful Tests • Usual parametric data model p(y; θ) • Consider a single-sided problem: H0 : θ = θ0, H1 : θ θ0 • What to do if there is no UMP test? • The locally most powerful test of level α has a power curve that maximizes the slope of PD(θ) at θ = θ0 φLMP = arg max φ∈{α−level} d dθ PD(φ; θ) = arg max φ∈{α−level} d dθ Eθ[φ] 1
  • 290. L M P T α =PFA(φB)=PFA(φA)=PFA(φLMP ) θ0 θ PD(φLMP ,θ) PD(φB,θ) PD(φA,θ) Figure 1: Graphical Interpretation of LMP Test. • Solution to LMP Problem • Using a proof similar to that for the Neyman-Pearson lemma, one can show the LMP test is d dθ p(y; θ)
  • 291.
  • 292. θ=θ0 p(y; θ0) H1 H0 λ where we pick λ to achieve Eθ0 [φ] 6 α • As before, we may need a randomized test if there is a nonzero prob. of landing exactly on the threshold (but we wont worry about that) 2
  • 293. L M P T • Another Interpretation of LMP Test d dθ p(y; θ)
  • 294.
  • 296.
  • 297.
  • 298.
  • 299. θ=θ0 H1 H0 λ Suppose λ = 0 and likelihood is unimodal θ0 θ̂ML(y) θ Figure 2: Decide H1 if the slope at θ0 is positive, i.e. ML estimate gives evidence that θ θ0. • One-Sided Gaussian Mean Example yi ∈ N(θ, σ2 ), H0 : θ = 0, H1 : θ 0 d dθ ln p(y; θ) = d dθ c − n X i=1 (y − θ)2 2σ2 #
  • 300.
  • 301.
  • 302.
  • 303.
  • 306.
  • 307.
  • 308.
  • 309.
  • 310. θ=0 H1 H0 λ √ n σ ȳ H1 H0 λ σ √ n | {z } Just the UMP test we discussed before = γ • Cauchy Median Example • Suppose we have n Cauchy samples p(y; θ) = n Y i=1 1 π 1 1 + (yi − θ)2 and we want to decide H0 : θ = 0, H1 : θ 0 d dθ ln p(y; θ) = d dθ {c − ln[1 + (yi − θ)2 ]} 2 n X i=1 (yi − θ) 1 + (yi − θ)2
  • 311.
  • 312.
  • 313.
  • 314.
  • 315. θ=0 = 2 n X i=1 yi 1 + y2 i H1 H0 λ • Step 1: Pass each data point through a memoryless nonlinearity g(y) = y 1 + y2 4
  • 316. L M P T −5 −4 −3 −2 −1 0 1 2 3 4 5 −0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5 y g(y) 1 Figure 3: Memoryless nonlinearity. • Step 2: Sum all the nonlinearity outputs • Step 3: Compare to a threshold • Two-Sided LMP Tests: • Now consider a double-sided problem: H0 : θ = θ0, H1 : θ 6= θ0 • The locally most powerful “unbiased” test of level α has a power curve which 5
  • 317. L M P T maximizes the curvature of PD(θ) at θ = θ0 φLMP = arg max φ d2 dθ2 PD(φ; θ)
  • 318.
  • 319.
  • 320.
  • 321. θ=θ0 subject to PF A(φ) = Eθ0 [φ] ≤ α and d dθ PD(φ, θ)
  • 322.
  • 323.
  • 324.
  • 326.
  • 327.
  • 328.
  • 329. θ=θ0 = 0 α = PF A(φA) = PF A(φLMP U ) PD(φLMPU ,θ) PD(φA,θ) α θ0 θ Figure 4: Graphical Interpretation of LMPU Test. • Say a test is unbiased if PD(θ) PF A∀θ 6
  • 330. L M P T • Solution to LMPU Problem: • Using Lagrange multiplers, one can show the test is d2 dθ2 p(y; θ)
  • 331.
  • 332.
  • 333.
  • 334. θ=θ0 H1 H0 λ p(y; θ0) + ρ d dθ p(y; θ)
  • 335.
  • 336.
  • 337.
  • 338. θ=θ0 ! where λ and ρ are picked to satisfy constraints • If we’re really lucky, it sometimes turns out that rho = 0 d2 dθ2 p(y; θ)
  • 339.
  • 340.
  • 341. θ=θ0 p(y; θ0) H1 H0 λ • Two-Sided Gaussian Mean Example: yi ∈ N(θ, σ2 ), H0 : θ = 0, H1 : θ 6= 0 p(y; θ) = 1 p 2πσ2/n exp − (ȳ − θ)2 2σ2/n d dθ p(y; θ) = ȳ − θ σ2/n p(y; θ) d2 dθ2 p(y; θ) = ȳ − θ σ2/n d dθ p(y; θ) − 1 σ2/n p(y; θ) 7
  • 343.
  • 344.
  • 345.
  • 346. θ=θ0 H1 H0 λ p(y; θ) + ρ d dθ p(y; θ)
  • 347.
  • 348.
  • 349.
  • 350. θ=θ0 ! ȳ σ2/n 2 − 1 σ2/n p(y; 0) p(y; 0) + ρ ȳ σ2/n p(y; 0) H1 H0 λ ȳ2 − σ2 /n (σ2/n + ρȳ) H1 H0 λ(σ2 /n) Usual first attempt: Wishful thinking! Try ρ = 0 and hope it works! Assuming this works for the moment, the test is |ȳ| H1 H0 p λ(σ2/n)2 + σ2/n ≡ γ α = PF A = Pr
  • 351.
  • 352.
  • 353.
  • 354. γ|H0 = 1−Pr −γ Ȳ γ|H0 = 1 − Pr    −γ √ n σ Ȳ √ n σ | {z } N (0,1) γ √ n σ
  • 355.
  • 356.
  • 357.
  • 358.
  • 359.
  • 360.
  • 361.
  • 363. L M P T = 2Q(γ √ n/σ) γ = σ √ n Q−1 (α/2) • Computing the Power Curve PD(θ) = Pr
  • 364.
  • 365.
  • 366.
  • 367. γ|H1 = Pr Ȳ γ|H1 + Pr Ȳ −γ|H1 = Pr    (Ȳ − θ) √ n σ | {z } N (0,1) (γ − θ) √ n σ
  • 368.
  • 369.
  • 370.
  • 371. H1     + Pr    (Ȳ − θ) √ n σ | {z } N (0,1) −(γ + θ) √ n σ
  • 372.
  • 373.
  • 374.
  • 375. H1     = Pr N (γ − θ) √ n σ
  • 376.
  • 377.
  • 378.
  • 379. H1 +Pr N − (γ + θ) √ n σ
  • 380.
  • 381.
  • 382.
  • 383. H1 = Pr N (γ − θ) √ n σ
  • 384.
  • 385.
  • 386.
  • 387. H1 +Pr −N (γ + θ) √ n σ
  • 388.
  • 389.
  • 390.
  • 391. H1 = Q (γ − θ) √ n/σ + Q (γ + θ) √ n/σ = Q Q−1 (α/2) − d + Q (Q−1 (α/2) + d) 9
  • 392. L M P T where d = θ √ n/σ 10
  • 393. G L R T Generalized Likelihood Ratio Tests and Model Order Selection Criteria • Usual parametric data model p(y; θ) • In previous lectures on LMP tests, we assumed specials structures like: H0 : θ = θ0, H1 : θ θ0 or H0 : θ = θ0, H1 : θ 6= θ0 • What should we do if we have a more general structure like: H0 : θ ∈ S0, H1 : θ ∈ S1 • Often, we do something a bit ad-hoc. 1
  • 394. G L R T The GLRT • Find parameter estimates θ̂0 and θ̂1 under H0 and H1 • Substituting estimates into likelihood ratio yields a generalized likelihood ratio test • Substituting estimates into likelihood ratio yields a generalized likelihood ratio test ΛGLR(y) = p(y; θ̂1) p(y; θ̂0) H1 H0 λ • If convenient, use ML estimates: max θ∈S1 p(y; θ) max θ∈S0 p(y; θ) H1 H0 λ 2
  • 395. G L R T Two Sided Gaussian Mean Example (1) Yi ∼ N(θ, σ2 ), H0 : θ = 0, H1 : θ 6= 0 ln p(y; θ̂) p(y; 0) = − n X i=1 yi − 1 n n P j=1 yj !2 2σ2 + n X i=1 y2 i 2σ2 = n P i=1 2yi 1 n n P j=1 yj ! − n 1 n n P i=1 yi 2 2σ2 3
  • 396. G L R T Two-Sided Gaussian Mean Example (2) 2n 1 n n P i=1 yi 2 − n 1 n n P i=1 yi 2 2σ2 = n 2σ2 ȳ2 H1 H0 λ |ȳ| H1 H0 γ 4
  • 397. G L R T Some Gaussian Examples • Single population: • Statistic has a Student-T distribution • Asymptotically Gaussian – Tests on mean, with unknown variance yield “T-tests” – A T-test is any statistical hypothesis test in which the test statistic has a Student’s t distribution if the null hypothesis is true. It is applied when sample sizes are small enough that using an assumption of normality and the associated z-test leads to incorrect inference. Suppose X1, · · · , Xn are independent random variables that are normally distributed with expected value µ and variance σ2 . Let Xn = (X1 + · · · + Xn)/n 5
  • 398. G L R T be the sample mean, and Sn 2 = 1 n − 1 n X i=1 Xi − Xn 2 be the sample variance. It is readily shown that the quantity Z = Xn − µ σ/ √ n is normally distributed with mean 0 and variance 1, since the sample mean Xn is normally distributed with mean µ and standard deviation σ/ √ n. Gosset studied a related quantity, T = Xn − µ Sn/ √ n , which differs from Z in that the exact standard deviation σ is replaced by the random variable Sn. Technically, (n−1)S2 n/σ2 has a χ2 n−1 distribution by Cochran’s theorem. Gosset’s work showed that T has the probability 6
  • 399. G L R T density function f(t) = Γ(ν+1 2 ) √ νπ Γ(ν 2 ) 1 + t2 ν −( ν+1 2 ) , ν = n−1 Confidence intervals derived from Student’s t-distribution Suppose the number A is so chosen that Pr(−A T A) = 0.9, when T has a t-distribution with n − 1 degrees of freedom. This is the same as Pr(T A) = 0.95, so A is the “95th percentile” of this probability distribution, or A = t(0.05,n−1). Then Pr −A Xn − µ Sn/ √ n A = 0.9, Pr Xn − A Sn √ n µ Xn + A Sn √ n = 0.9 Therefore the interval whose endpoints 7
  • 400. G L R T are Xn ± A Sn √ n is a 90-percent confidence interval for µ. • Two populations: – Tests on equality of variances, with unknown means yields a “Fisher F-test” – Statistic has a Fisher-F distribution An F-test is any statistical test in which the test statistic has an F-distribution if the null hypothesis is true A random variate of the F-distribution arises as the ratio of two chi-squared variates: U1/d1 U2/d2 U1 and U2 have chi-square distributions with d1 and d2 degrees of freedom respectively The probability density function of an F(d1, d2) distributed random variable is 8
  • 401. G L R T given by g(x) = 1 β(d1/2, d2/2) d1 x d1 x + d2 d1/2 1 − d1 x d1 x + d2 d2/2 x−1 for real x 0, where d1 and d2 are positive integers, and β is the beta function, β(x, y) = R 1 0 tx−1 (1 − t)y−1 dt. – Asymptotically Chi-Square • Suppose n → ∞. Since the ML estimates are asymptotically consistent, the GLRT is asymptotically UMP • If the GLRT is hard to analyze directly, sometimes asymptotic results can help • Assume a partition θ = (ϕ1, · · · , ϕp, ξ1, · · · , ξq | {z } (nuisance parameters) ) • Consider GLRT for a two-sided problem H0 : φ = φ0, H1 : φ 6= φ1 where ξ is unknown, but we don’t care what it is 9
  • 402. G L R T • When the density p(y; θ) is smooth under H0, it can be shown that for large n 2 ln ΛGLR(Y ) = 2 ln p(Y ; θ̂) p(Y ; θ0) ∼ χp χp = Chi-square with p degrees of freedom E[χp] = p, var(χp) =2p Link to Bayesian Land • Remember if we had a prior p(θ), we could handle composite hypothesis tests by integrating and reducing things to a simple hypothesis test p(y) = Z Rp p(y|θ)p(θ)dθ • If p(θ) varies slowly compared to p(y|θ) around the MAP estimate, we can approx. p(y) ≈ p(θ) Z Rp exp[L(θ)]dθ • Suppose MAP and ML estimates are approximately equal Laplace’s Approximation 10
  • 403. G L R T • Do a Taylor series expansion Z Rp exp[L(θ̂ML) − (θ − θ̂ML)T F(y; θ)(θ − θ̂ML)T 2 ]dθ = eL(θ̂ML) Z Rp exp − (θ − θ̂ML)T F(y; θ̂ML)(θ − θ̂ML)T 2 # dθ where F(y; θ̂ML) = − d2 L(θ) dθrdθc
  • 404.
  • 405.
  • 406.
  • 407. θ=θ̂ML #    Empirical Fisher Info • Recognize quadratic form of the Gaussian: Z Rp exp − (θ − θ̂ML)T F(y; θ̂ML)(θ − θ̂ML)T 2 # dθ = (2π)p/2 q det F(y; θ̂ML) so p(y) = p(θ̂ML)p(y|θ̂ML) (2π)p/2 q det F(y; θ̂ML) Large Sample Sizes 11
  • 408. G L R T • Consider the logdensity: ln p(y) ≈ ln p(θ̂ML) + ln p(y|θ̂ML) + p 2 ln 2π − 1 2 ln det F(y; θ̂ML) • Suppose we have n iid samples. By the law of large numbers: ln det F(y|θ̂ML) ≈ ln det F(θ̂ML) = ln det[nI] + ln det F1(θ̂ML) = ln np +ln det F1(θ̂ML) = p ln n+ln det F1(θ̂ML) Schwarz’s Result • As n → ∞ ln p(y) ≈ ln p(θ̂ML) + L(θ̂ML) + p 2 ln 2π − 1 2 p ln n − 1 2 ln det F1(θ̂ML) ≈ L(θ̂ML) − p 2 ln n • Called Bayesian Information Criterion (BIC) or Schwarz Information Criterion (SIC) • Often used in model selection; second term 12
  • 409. G L R T is a penalty on model complexity Minimum Description Length • BIC is related to Rissanens Minimum Description Length criterion; (p/2) ln(n) is viewed as the optimum number of “nats” (like bits, but different base) used to encode the ML parameter estimate with limited precision • Data is encoded with a string of length description length = −L(θ̂ML) + p 2 ln n • Choose model which describes the data using the smallest number of bits (or nats) 13
  • 410. G L R T General Multivariate Gaussian Detection Problems • We have a data vector y = [y1, · · · , yn]T distributed according to N(µ, R) • Two hypotheses: H0 : µ = µ0, R = R0, H1 : µ = µ1, R = R1 • Likelihood ratio: Λ(y) = √ det R0 exp[−1 2 (y − µ1)T R−1 1 (y − µ1)] √ det R1 exp[−1 2 (y − µ0)T R−1 0 (y − µ0)] • Test looks like T(y) H1 H0 log τ + 1 2 log det R1 det R0 ≡ γ where T(y) ≡ (y − µ0)T R−1 0 (y − µ0) 2 − (y − µ1)T R−1 1 (y − µ1) 2 Mahalanobis Distance Interpretation • Define a norm on ℜn : kzkR = √ zT R−1z 1
  • 411. G L R T • Emphasizes components of z which are collinear to eigenvectors of R associated with small eigenvalues • we can rewrite test statistic as T(y) = ky − µ0k 2 R0 2 − ky − µ1k 2 R1 2 Quadratic Form Interpretation • Alternatively, express using a new statistic T′ (y) ≡ yT (R−1 0 − R−1 1 )y 2 +(µT 1 R−1 1 −µT 0 R−1 0 )y and a new threshold γ′ = ln τ+ 1 2 ln det R1 det R0 +µT 1 R−1 1 µ1−µT 0 R−1 0 µ0 Four Kinds of Decision Regions 1. If the covariances are equal, i.e. R = R0 = R1, the test reduces to (µT 1 −µT 0 )R−1 y = aT y, a ≡ R−1 (µ1 −µ0) and the decision region is a hyperplane 2
  • 412. G L R T 2. If R0 R1 i.e., R0 − R1 is positive definite, the H1 decision region is the interior of an ellipsoid 3. If R0 R1 i.e., R0 − R1 is negative definite, the H1 decision region is the exterior of an ellipsoid 4. If none of the above apply, i.e. R0 − R1 is neither singular, positive definite, or negative definite, then the decision region has hyperbolic boundaries Known Signal in White Noise • Consider the familiar special case: µ0 = 0, µ1 = s, R = σ2 I • Test can be expressed as T(y) ≡ sT y H1 H0 log τ − ksk 2 σ2 ≡ γ • Analysis from univariate case now applies “Deflection ratio” or “detectability index” is d2 = ksk 2 /σ2 ≡ SNR 3
  • 413. G L R T Known Signal in Colored Noise • Now let R be general • Neyman-Pearson test has the form sT R−1 y √ sT R−1s H1 H0 Q−1 (α) • Could transform to white noise case by linearly preprocessing the data ỹ = Hy to give an equivalent test in terms of s̃ = Hs s̃T ỹ ks̃k H1 H0 Q−1 (α) The Prewhitening Transformation • Use MATLAB (or whatever) to compute the eigendecomposition R = UDUT • We can define R1/2 = UD1/2 • Our prewhitening operation is H = R−1/2 = D−1/2 U−1 = D−1/2 UT 4
  • 414. G L R T Performance for Colored Noise • Test statistic for the colored noise problem is still Gaussian, so we can again use formulas from the univariate case with d2 = ks̃k 2 = sT R−1 s • In white noise case, performance depended only on the total power of s, not its shape • Here, in the colored noise case, performance depends on shape as well! Signal Design for Colored Noise • Problem: maximize d2 = sT R−1 s subject to the constraint ||s||2 = 1 • Rayleigh quotient theorem says: sT R−1 s sT s ≤ 1 mini λi furthermore, we have equality if s is a minimizing eigenvector of R • So, to make d2 big, pick s to be that minimizing eigenvector 5
  • 415. G L R T • For equal means, test statistic purely quadratic (y − µ)T (R−1 0 − R−1 1 )(y − µ) • Analysis is simplified by prefiltering to diagonalize R−1 0 − R−1 1 • Still a total pain; test statistic is a mixture of chi-square random variables • For unequal means, it’s even more of a pain; test statistic is mixture of noncentral chi-square random variables A Simple Zero-Mean Case H0 : Yk = Wk, H1 : Yk = Sk + Wk Wk ∼ N(0, σ2 W ), Sk ∼ N(0, σ2 SK ) note signal power is time-varying • In our generic notation, we have R0 = diag(σ2 W ), R1 = diag(σ2 W + σ2 Sk ) 6
  • 416. G L R T R−1 0 − R−1 1 = diag 1 σ2 W − 1 σ2 W + σ2 Sk ! R−1 0 − R−1 1 = diag 1 σ2 W σ2 Sk [σ2 W + σ2 Sk ] ! κk = σ2 Sk [σ2 W + σ2 Sk ] • Test statistic is yT (R−1 0 − R−1 1 )y = 1 σ2 W n X k=1 κky2 k • Several different interpretations Filter-Squarer Interpretation n X k=1 κky2 k = n X k=1 ( √ κkyk)2 7
  • 417. G L R T yk × κk (·)2 P γ H1 H0 Estimator-Correlator Interpretation n X k=1 κky2 k = n X k=1 yk(κkyk) = n X k=1 yk σ2 Sk σ2 W + σ2 Sk yk ! Ŝk = E[Sk|Yk = yk] yk × Ŝk σ2 Sk σ2 W +σ2 Sk P γ H1 H0 8
  • 418. G L R T Wide-Sense-Stationary Sequences • Suppose Sk and Wk are wide-sense stationary Gaussian time series with power spectral densities SS(ejω ), SW (ejω ), ω ∈ [0, 2π) • Estimator-Correlator structure generalizes yk × Ŝk SS (ω) SS (ω)+SW (ω) P γ H1 H0 9
  • 419. D T S D I A N Discrete-Time Signal Detection in Additive Noise Signal w/Known Amplitude in IID Noise • We have data samples y1, · · · , yn drawn from H0 : Yk = Nk or H1 : Yk = sk + Nk • The loglikelihood ratio test test compares a threshold to: n X k=1 ln Lk(yk) = n X k=1 ln pN (yk − sk) pN (yk) • General detector for known amplitude: yk log Lk a function of sk P γ H1 H0 Gaussian Case (Known Amplitude) 1
  • 420. D T S D I A N • If noise is N(0, σ2 ), n X k=1 ln Lk(yk) = n X k=1 sk(yk − sk/2) = n X k=1 skyk − s2 k/2 • In a Neyman-Pearson or MinMax setting, we could absorb the s2 k/2 term into the threshold, and the test statistic is just a correlation n X k=1 skyk Laplacian Case (Known Amplitude) • Suppose noise has a Laplacian density pN (x) = α 2 exp(−α |x|) • Optimum detector looks like yk + −sk/2 − − p p |sk|/2 |sk|/2 −|sk|/2 −|sk|/2 × sgn(sk) P γ H1 H0 2