Robustness under Independent Contamination Model

Robustness under Independent Contamination

Mike Danilov

November 21, 2009

1 / 17

Traditional robustness
Deﬁnition of contamination
Simple examples
Weighted representation

Independent Contamination
The Idea
Why traditional robust estimates don’t work
Naive approaches
Cell-weighting approach

2 / 17

The Problem (aka Disclaimer) and Terminology

Estimation of mean vector µ and covariance matrix Σ of
supposedly i.i.d. multivariate sample: x1 , . . . , xn ∈ Rp .
Data matrix
  
x1 x11 x12 ... x1p
 x   x21 x22 ... x2p 
 2 
X= . = .

. . . 
 .   .
. . .
. .
. . 
.
xn xn1 xn2 . . . xnp

Vectors xi ∈ Rp – data cases
Values xij ∈ R – data values or cells

3 / 17

Types of error in Statistics
1. Usual statistical error.
Every observation is moderately aﬀected

Xobs = Xmean + e, with e ∼ N (0, σ 2 )
where variance of e deﬁnes the quality of the data.

2. Contamination.
Some observations are ruined:

Xgood , usually
Xobs =
Xhorrible , sometimes.

Typically comes on top of the usual error:

Xgood = Xmean + e.
4 / 17

Mixture contamination model
Observed data come from the mixture distribution
F = (1 − ε)F0 (θ) + εH
F0 (θ) is the distribution of interest
H is an arbitrary unknown nuisance distribution.
Equivalently
X = (1 − B)Xgood + BXhorrible ,
where B is a Bernoulli(ε) indicator.
Estimate T (F ): feed data from F , obtain estimates for θ.
Breakdown point

εBP (T ) = sup sup T (F (θ, ε, H)) < ∞
ε H
that is the maximum ε such that T can still isolate F0 from H.
Maximum achievable (and desirable)
εBP (T ) ≤ 0.5.
5 / 17

Examples: simple robust estimates

Location
Median: x(n/2)
n(1−δ/2)
1
Trimmed mean: x(i) , with δ ∈ (0, 1).
n(1 − δ)
i=nδ/2

Scale
MAD: Median |xi − Median xj |
i j
IQR: x(n/4) − x(3n/4)
Regression
LMS: arg min Median(yi − β xi )2
β i

6 / 17

Examples: multivariate robust estimates
Minimum Covariance Determinant (MCD) by Rousseeuw (1985):
minimize determinant of sample covariance of 50% of data points:
6

Sample Covariance
4

MCD
2

Clean
0
−2
−4
−6

7 / 17

Weighted representation
Many robust estimates can be represented as weighted versions of
familiar estimates
n
i=1 wi xi
ˆ
µ= n
i=1 wi

n
ˆ i=1 wi (xi − µ)(xi
ˆ − µ)
ˆ
Σ= n ,
i=1 wi

with weights depending on the estimates themselves

ˆ ˆ
wi = w(MD(xi ; µ, Σ)),

where Mahalanobis Distances are given by

MD(xi ; µ, Σ) = (xi − µ) Σ−1 (xi − µ).
ˆ ˆ ˆ ˆ ˆ

8 / 17

Contaminated cells not cases
Traditional Contamination Independent Contamination

ε = 10%

q q

9 / 17

Generalized Contamination

Data entry errors, hardware malfunction, etc
Can express as

Xj = (1 − Bj )(XGood )j + Bj (XHorrible )j , for j = 1, . . . , p,

or, in matrix form, as

X = (1 − B)X Good + BX Horrible ,

where B is a vector of Bernoulli r.v.’s
B’s dependence structure is important
Will assume Independent Contamination: all Bj are
independent and independent of X’s.
Also: P[Bj = 1] = ε for simplicity.

10 / 17

Number of clean cases

each case will appear as outlier if diagnosed with MD’s
P[case is clean] = (1 − ε)p
e.g. with ε = 0.05 and p = 20 — only 20% are clean
waste of data
exceeds breakdown point of traditional robust estimates.

11 / 17

Aﬃne-equivariance

Deﬁnition: if data set Y = A + XB, then

ˆ ˆ
µ(Y ) = A + B µ(Y )
ˆ ˆ
Σ(Y ) = B ΣB,

Desirable: easy to study etc
Most “respectable” robust estimates are A-E
Alqallaf et al (2009) have a proof that reasonable A-E
estimates cannot be robust against IC
if know how it behaves on X, then know for Y ; and vice versa

12 / 17

Aﬃne Transformation of Contaminated Data
Original Contaminated Transformed

X → Y = XB

−→

q q

13 / 17

Pairwise approach

P[pair of variables are clean] = (1 − ε)2 (1 − ε)p
ˆ
Estimate all elements Σab , for a, b = 1, . . . , p separately
Problem: multivariate structure is damaged/destroyed
Particular problem: may not be positive-deﬁnite.
May or may not be a problem. Usually is.
Studied to some extent by Alqallaf (2003, PhD thesis)

14 / 17

Detecting cells

Some are obvious: univariate outliers
Some only show up with respect to other cells: structural
outliers
Van Aelst et al (2009) use Stahel-Donoho projections
Little and Smith (1987) used partial Mahalanobis distances:

ˆ ˆ
if MD(x; µ, Σ) is large,
ˆ ˆ
consider MD(x−j ; µ, Σ) for all j = 1, . . . , p.

Mike explores MD-approach and iterative estimation of
covariances in his thesis.

15 / 17

Weighted estimate with cell weights

Van Aelst et al (2009) proposed a weighted estimate, but it is
pairwise and not SPD
Mike knows how to deal with zero weights - remove the values
and treat them as MCAR. Then do MLE via EM, for example.
Proper cell-weighted estimate is still to be developed.

16 / 17

Robustness under Independent Contamination Model

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (19)

Similar a Robustness under Independent Contamination Model

Similar a Robustness under Independent Contamination Model (20)

Último

Último (20)

Robustness under Independent Contamination Model