SlideShare una empresa de Scribd logo
1 de 17
Descargar para leer sin conexión
Robustness under Independent Contamination

                Mike Danilov


              November 21, 2009




                                             1 / 17
Traditional robustness
   Definition of contamination
   Simple examples
   Weighted representation


Independent Contamination
   The Idea
   Why traditional robust estimates don’t work
   Naive approaches
   Cell-weighting approach




                                                 2 / 17
The Problem (aka Disclaimer) and Terminology


      Estimation of mean vector µ and covariance matrix Σ of
      supposedly i.i.d. multivariate sample: x1 , . . . , xn ∈ Rp .
      Data matrix
                                                        
                     x1     x11            x12   ...   x1p
                    x   x21             x22   ...   x2p 
                    2 
                 X= . = .
                                                           
                                            .     .     . 
                    .   .
                      .      .              .
                                            .     .
                                                  .     . 
                                                        .
                         xn         xn1 xn2 . . .      xnp

      Vectors xi ∈ Rp – data cases
      Values xij ∈ R – data values or cells




                                                                      3 / 17
Types of error in Statistics
     1. Usual statistical error.
        Every observation is moderately affected

                   Xobs = Xmean + e, with e ∼ N (0, σ 2 )
       where variance of e defines the quality of the data.



     2. Contamination.
        Some observations are ruined:

                               Xgood ,       usually
                      Xobs =
                               Xhorrible ,   sometimes.

       Typically comes on top of the usual error:

                            Xgood = Xmean + e.
                                                             4 / 17
Mixture contamination model
      Observed data come from the mixture distribution
                         F = (1 − ε)F0 (θ) + εH
          F0 (θ) is the distribution of interest
          H is an arbitrary unknown nuisance distribution.
      Equivalently
                     X = (1 − B)Xgood + BXhorrible ,
      where B is a Bernoulli(ε) indicator.
      Estimate T (F ): feed data from F , obtain estimates for θ.
          Breakdown point

                     εBP (T ) = sup sup T (F (θ, ε, H)) < ∞
                                ε      H
          that is the maximum ε such that T can still isolate F0 from H.
          Maximum achievable (and desirable)
                                    εBP (T ) ≤ 0.5.
                                                                           5 / 17
Examples: simple robust estimates


      Location
          Median: x(n/2)
                                      n(1−δ/2)
                              1
          Trimmed mean:                          x(i) , with δ ∈ (0, 1).
                           n(1 − δ)
                                      i=nδ/2

      Scale
          MAD: Median |xi − Median xj |
                    i             j
          IQR: x(n/4) − x(3n/4)
      Regression
          LMS: arg min Median(yi − β xi )2
                   β       i




                                                                           6 / 17
Examples: multivariate robust estimates
   Minimum Covariance Determinant (MCD) by Rousseeuw (1985):
   minimize determinant of sample covariance of 50% of data points:
           6


                             Sample Covariance
           4




                 MCD
           2




                Clean
           0
           −2
           −4
           −6




                                                                      7 / 17
Weighted representation
   Many robust estimates can be represented as weighted versions of
   familiar estimates
                                   n
                                   i=1 wi xi
                           ˆ
                           µ=        n
                                     i=1 wi


                           n
                   ˆ       i=1 wi (xi − µ)(xi
                                         ˆ      − µ)
                                                  ˆ
                   Σ=                n                 ,
                                     i=1 wi

   with weights depending on the estimates themselves

                                       ˆ ˆ
                        wi = w(MD(xi ; µ, Σ)),

   where Mahalanobis Distances are given by

                MD(xi ; µ, Σ) = (xi − µ) Σ−1 (xi − µ).
                        ˆ ˆ           ˆ ˆ          ˆ

                                                                      8 / 17
Contaminated cells not cases
  Traditional Contamination             Independent Contamination




                              ε = 10%




       q                                     q




                                                                9 / 17
Generalized Contamination

      Data entry errors, hardware malfunction, etc
      Can express as

       Xj = (1 − Bj )(XGood )j + Bj (XHorrible )j , for j = 1, . . . , p,

      or, in matrix form, as

                    X = (1 − B)X Good + BX Horrible ,

      where B is a vector of Bernoulli r.v.’s
      B’s dependence structure is important
      Will assume Independent Contamination: all Bj are
      independent and independent of X’s.
      Also: P[Bj = 1] = ε for simplicity.


                                                                            10 / 17
Number of clean cases




      each case will appear as outlier if diagnosed with MD’s
      P[case is clean] = (1 − ε)p
      e.g. with ε = 0.05 and p = 20 — only 20% are clean
      waste of data
      exceeds breakdown point of traditional robust estimates.




                                                                 11 / 17
Affine-equivariance


      Definition: if data set Y = A + XB, then

                          ˆ              ˆ
                          µ(Y ) = A + B µ(Y )
                            ˆ          ˆ
                            Σ(Y ) = B ΣB,

      Desirable: easy to study etc
      Most “respectable” robust estimates are A-E
      Alqallaf et al (2009) have a proof that reasonable A-E
      estimates cannot be robust against IC
      if know how it behaves on X, then know for Y ; and vice versa




                                                                      12 / 17
Affine Transformation of Contaminated Data
   Original Contaminated                    Transformed


                           X → Y = XB



                           −→


      q                                 q




                                                          13 / 17
Pairwise approach




      P[pair of variables are clean] = (1 − ε)2        (1 − ε)p
                              ˆ
      Estimate all elements Σab , for a, b = 1, . . . , p separately
      Problem: multivariate structure is damaged/destroyed
      Particular problem: may not be positive-definite.
      May or may not be a problem. Usually is.
      Studied to some extent by Alqallaf (2003, PhD thesis)




                                                                       14 / 17
Detecting cells


       Some are obvious: univariate outliers
       Some only show up with respect to other cells: structural
       outliers
       Van Aelst et al (2009) use Stahel-Donoho projections
       Little and Smith (1987) used partial Mahalanobis distances:

                                   ˆ ˆ
                          if MD(x; µ, Σ) is large,
                                  ˆ ˆ
                consider MD(x−j ; µ, Σ) for all j = 1, . . . , p.

       Mike explores MD-approach and iterative estimation of
       covariances in his thesis.




                                                                     15 / 17
Weighted estimate with cell weights




      Van Aelst et al (2009) proposed a weighted estimate, but it is
      pairwise and not SPD
      Mike knows how to deal with zero weights - remove the values
      and treat them as MCAR. Then do MLE via EM, for example.
      Proper cell-weighted estimate is still to be developed.




                                                                       16 / 17
The End


          17 / 17

Más contenido relacionado

La actualidad más candente

random variables-descriptive and contincuous
random variables-descriptive and contincuousrandom variables-descriptive and contincuous
random variables-descriptive and contincuousar9530
 
Third-kind Chebyshev Polynomials Vr(x) in Collocation Methods of Solving Boun...
Third-kind Chebyshev Polynomials Vr(x) in Collocation Methods of Solving Boun...Third-kind Chebyshev Polynomials Vr(x) in Collocation Methods of Solving Boun...
Third-kind Chebyshev Polynomials Vr(x) in Collocation Methods of Solving Boun...IOSR Journals
 
Math 1300: Section 8-3 Conditional Probability, Intersection, and Independence
Math 1300: Section 8-3 Conditional Probability, Intersection, and IndependenceMath 1300: Section 8-3 Conditional Probability, Intersection, and Independence
Math 1300: Section 8-3 Conditional Probability, Intersection, and IndependenceJason Aubrey
 
Chapter3 econometrics
Chapter3 econometricsChapter3 econometrics
Chapter3 econometricsVu Vo
 
Conformable Chebyshev differential equation of first kind
Conformable Chebyshev differential equation of first kindConformable Chebyshev differential equation of first kind
Conformable Chebyshev differential equation of first kindIJECEIAES
 
Lecture on solving1
Lecture on solving1Lecture on solving1
Lecture on solving1NBER
 
A comparative analysis of predictve data mining techniques3
A comparative analysis of predictve data mining techniques3A comparative analysis of predictve data mining techniques3
A comparative analysis of predictve data mining techniques3Mintu246
 
Qt random variables notes
Qt random variables notesQt random variables notes
Qt random variables notesRohan Bhatkar
 
3rd NIPS Workshop on PROBABILISTIC PROGRAMMING
3rd NIPS Workshop on PROBABILISTIC PROGRAMMING3rd NIPS Workshop on PROBABILISTIC PROGRAMMING
3rd NIPS Workshop on PROBABILISTIC PROGRAMMINGChristian Robert
 
7 - Model Assessment and Selection
7 - Model Assessment and Selection7 - Model Assessment and Selection
7 - Model Assessment and SelectionNikita Zhiltsov
 
Solvability of Matrix Riccati Inequality Talk Slides
Solvability of Matrix Riccati Inequality Talk SlidesSolvability of Matrix Riccati Inequality Talk Slides
Solvability of Matrix Riccati Inequality Talk SlidesKevin Kissi
 
Math 1300: Section 5-2 Systems of Inequalities in two variables
Math 1300: Section 5-2 Systems of Inequalities in two variablesMath 1300: Section 5-2 Systems of Inequalities in two variables
Math 1300: Section 5-2 Systems of Inequalities in two variablesJason Aubrey
 
Lesson 6: Limits Involving Infinity (handout)
Lesson 6: Limits Involving Infinity (handout)Lesson 6: Limits Involving Infinity (handout)
Lesson 6: Limits Involving Infinity (handout)Matthew Leingang
 
A Coq Library for the Theory of Relational Calculus
A Coq Library for the Theory of Relational CalculusA Coq Library for the Theory of Relational Calculus
A Coq Library for the Theory of Relational CalculusYoshihiro Mizoguchi
 

La actualidad más candente (19)

random variables-descriptive and contincuous
random variables-descriptive and contincuousrandom variables-descriptive and contincuous
random variables-descriptive and contincuous
 
Third-kind Chebyshev Polynomials Vr(x) in Collocation Methods of Solving Boun...
Third-kind Chebyshev Polynomials Vr(x) in Collocation Methods of Solving Boun...Third-kind Chebyshev Polynomials Vr(x) in Collocation Methods of Solving Boun...
Third-kind Chebyshev Polynomials Vr(x) in Collocation Methods of Solving Boun...
 
Math 1300: Section 8-3 Conditional Probability, Intersection, and Independence
Math 1300: Section 8-3 Conditional Probability, Intersection, and IndependenceMath 1300: Section 8-3 Conditional Probability, Intersection, and Independence
Math 1300: Section 8-3 Conditional Probability, Intersection, and Independence
 
FEC 512.04
FEC 512.04FEC 512.04
FEC 512.04
 
Pro dist
Pro distPro dist
Pro dist
 
Chapter3 econometrics
Chapter3 econometricsChapter3 econometrics
Chapter3 econometrics
 
Conformable Chebyshev differential equation of first kind
Conformable Chebyshev differential equation of first kindConformable Chebyshev differential equation of first kind
Conformable Chebyshev differential equation of first kind
 
Lecture on solving1
Lecture on solving1Lecture on solving1
Lecture on solving1
 
A comparative analysis of predictve data mining techniques3
A comparative analysis of predictve data mining techniques3A comparative analysis of predictve data mining techniques3
A comparative analysis of predictve data mining techniques3
 
Qt random variables notes
Qt random variables notesQt random variables notes
Qt random variables notes
 
3rd NIPS Workshop on PROBABILISTIC PROGRAMMING
3rd NIPS Workshop on PROBABILISTIC PROGRAMMING3rd NIPS Workshop on PROBABILISTIC PROGRAMMING
3rd NIPS Workshop on PROBABILISTIC PROGRAMMING
 
7 - Model Assessment and Selection
7 - Model Assessment and Selection7 - Model Assessment and Selection
7 - Model Assessment and Selection
 
msri_up_talk
msri_up_talkmsri_up_talk
msri_up_talk
 
Solvability of Matrix Riccati Inequality Talk Slides
Solvability of Matrix Riccati Inequality Talk SlidesSolvability of Matrix Riccati Inequality Talk Slides
Solvability of Matrix Riccati Inequality Talk Slides
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
 
MUMS: Bayesian, Fiducial, and Frequentist Conference - Inference on Treatment...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Inference on Treatment...MUMS: Bayesian, Fiducial, and Frequentist Conference - Inference on Treatment...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Inference on Treatment...
 
Math 1300: Section 5-2 Systems of Inequalities in two variables
Math 1300: Section 5-2 Systems of Inequalities in two variablesMath 1300: Section 5-2 Systems of Inequalities in two variables
Math 1300: Section 5-2 Systems of Inequalities in two variables
 
Lesson 6: Limits Involving Infinity (handout)
Lesson 6: Limits Involving Infinity (handout)Lesson 6: Limits Involving Infinity (handout)
Lesson 6: Limits Involving Infinity (handout)
 
A Coq Library for the Theory of Relational Calculus
A Coq Library for the Theory of Relational CalculusA Coq Library for the Theory of Relational Calculus
A Coq Library for the Theory of Relational Calculus
 

Similar a Robustness under Independent Contamination Model

Basics of probability in statistical simulation and stochastic programming
Basics of probability in statistical simulation and stochastic programmingBasics of probability in statistical simulation and stochastic programming
Basics of probability in statistical simulation and stochastic programmingSSA KPI
 
Mesh Processing Course : Geodesic Sampling
Mesh Processing Course : Geodesic SamplingMesh Processing Course : Geodesic Sampling
Mesh Processing Course : Geodesic SamplingGabriel Peyré
 
Intro probability 3
Intro probability 3Intro probability 3
Intro probability 3Phong Vo
 
Introduction to Evidential Neural Networks
Introduction to Evidential Neural NetworksIntroduction to Evidential Neural Networks
Introduction to Evidential Neural NetworksFederico Cerutti
 
Understanding variable importances in forests of randomized trees
Understanding variable importances in forests of randomized treesUnderstanding variable importances in forests of randomized trees
Understanding variable importances in forests of randomized treesGilles Louppe
 
從 VAE 走向深度學習新理論
從 VAE 走向深度學習新理論從 VAE 走向深度學習新理論
從 VAE 走向深度學習新理論岳華 杜
 
Generating Chebychev Chaotic Sequence
Generating Chebychev Chaotic SequenceGenerating Chebychev Chaotic Sequence
Generating Chebychev Chaotic SequenceCheng-An Yang
 
CVPR2010: higher order models in computer vision: Part 1, 2
CVPR2010: higher order models in computer vision: Part 1, 2CVPR2010: higher order models in computer vision: Part 1, 2
CVPR2010: higher order models in computer vision: Part 1, 2zukun
 
Probability and Statistics : Binomial Distribution notes ppt.pdf
Probability and Statistics : Binomial Distribution notes ppt.pdfProbability and Statistics : Binomial Distribution notes ppt.pdf
Probability and Statistics : Binomial Distribution notes ppt.pdfnomovi6416
 
multivariate normal distribution.pdf
multivariate normal distribution.pdfmultivariate normal distribution.pdf
multivariate normal distribution.pdfrishumaurya10
 
Voronoi diagrams in information geometry:  Statistical Voronoi diagrams and ...
Voronoi diagrams in information geometry:  Statistical Voronoi diagrams and ...Voronoi diagrams in information geometry:  Statistical Voronoi diagrams and ...
Voronoi diagrams in information geometry:  Statistical Voronoi diagrams and ...Frank Nielsen
 
Multivriada ppt ms
Multivriada   ppt msMultivriada   ppt ms
Multivriada ppt msFaeco Bot
 
Engr 213 midterm 2a sol 2009
Engr 213 midterm 2a sol 2009Engr 213 midterm 2a sol 2009
Engr 213 midterm 2a sol 2009akabaka12
 

Similar a Robustness under Independent Contamination Model (20)

Basics of probability in statistical simulation and stochastic programming
Basics of probability in statistical simulation and stochastic programmingBasics of probability in statistical simulation and stochastic programming
Basics of probability in statistical simulation and stochastic programming
 
Mesh Processing Course : Geodesic Sampling
Mesh Processing Course : Geodesic SamplingMesh Processing Course : Geodesic Sampling
Mesh Processing Course : Geodesic Sampling
 
Intro probability 3
Intro probability 3Intro probability 3
Intro probability 3
 
Introduction to Evidential Neural Networks
Introduction to Evidential Neural NetworksIntroduction to Evidential Neural Networks
Introduction to Evidential Neural Networks
 
Understanding variable importances in forests of randomized trees
Understanding variable importances in forests of randomized treesUnderstanding variable importances in forests of randomized trees
Understanding variable importances in forests of randomized trees
 
從 VAE 走向深度學習新理論
從 VAE 走向深度學習新理論從 VAE 走向深度學習新理論
從 VAE 走向深度學習新理論
 
Generating Chebychev Chaotic Sequence
Generating Chebychev Chaotic SequenceGenerating Chebychev Chaotic Sequence
Generating Chebychev Chaotic Sequence
 
Random Variables
Random VariablesRandom Variables
Random Variables
 
T tests anovas and regression
T tests anovas and regressionT tests anovas and regression
T tests anovas and regression
 
CVPR2010: higher order models in computer vision: Part 1, 2
CVPR2010: higher order models in computer vision: Part 1, 2CVPR2010: higher order models in computer vision: Part 1, 2
CVPR2010: higher order models in computer vision: Part 1, 2
 
Probability and Statistics : Binomial Distribution notes ppt.pdf
Probability and Statistics : Binomial Distribution notes ppt.pdfProbability and Statistics : Binomial Distribution notes ppt.pdf
Probability and Statistics : Binomial Distribution notes ppt.pdf
 
multivariate normal distribution.pdf
multivariate normal distribution.pdfmultivariate normal distribution.pdf
multivariate normal distribution.pdf
 
Voronoi diagrams in information geometry:  Statistical Voronoi diagrams and ...
Voronoi diagrams in information geometry:  Statistical Voronoi diagrams and ...Voronoi diagrams in information geometry:  Statistical Voronoi diagrams and ...
Voronoi diagrams in information geometry:  Statistical Voronoi diagrams and ...
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
 
Multivriada ppt ms
Multivriada   ppt msMultivriada   ppt ms
Multivriada ppt ms
 
probability assignment help (2)
probability assignment help (2)probability assignment help (2)
probability assignment help (2)
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
 
Engr 213 midterm 2a sol 2009
Engr 213 midterm 2a sol 2009Engr 213 midterm 2a sol 2009
Engr 213 midterm 2a sol 2009
 
Statistical Method In Economics
Statistical Method In EconomicsStatistical Method In Economics
Statistical Method In Economics
 
1 - Linear Regression
1 - Linear Regression1 - Linear Regression
1 - Linear Regression
 

Último

The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 

Último (20)

The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 

Robustness under Independent Contamination Model

  • 1. Robustness under Independent Contamination Mike Danilov November 21, 2009 1 / 17
  • 2. Traditional robustness Definition of contamination Simple examples Weighted representation Independent Contamination The Idea Why traditional robust estimates don’t work Naive approaches Cell-weighting approach 2 / 17
  • 3. The Problem (aka Disclaimer) and Terminology Estimation of mean vector µ and covariance matrix Σ of supposedly i.i.d. multivariate sample: x1 , . . . , xn ∈ Rp . Data matrix    x1 x11 x12 ... x1p  x   x21 x22 ... x2p   2  X= . = .  . . .   .   . . . . . . . .  . xn xn1 xn2 . . . xnp Vectors xi ∈ Rp – data cases Values xij ∈ R – data values or cells 3 / 17
  • 4. Types of error in Statistics 1. Usual statistical error. Every observation is moderately affected Xobs = Xmean + e, with e ∼ N (0, σ 2 ) where variance of e defines the quality of the data. 2. Contamination. Some observations are ruined: Xgood , usually Xobs = Xhorrible , sometimes. Typically comes on top of the usual error: Xgood = Xmean + e. 4 / 17
  • 5. Mixture contamination model Observed data come from the mixture distribution F = (1 − ε)F0 (θ) + εH F0 (θ) is the distribution of interest H is an arbitrary unknown nuisance distribution. Equivalently X = (1 − B)Xgood + BXhorrible , where B is a Bernoulli(ε) indicator. Estimate T (F ): feed data from F , obtain estimates for θ. Breakdown point εBP (T ) = sup sup T (F (θ, ε, H)) < ∞ ε H that is the maximum ε such that T can still isolate F0 from H. Maximum achievable (and desirable) εBP (T ) ≤ 0.5. 5 / 17
  • 6. Examples: simple robust estimates Location Median: x(n/2) n(1−δ/2) 1 Trimmed mean: x(i) , with δ ∈ (0, 1). n(1 − δ) i=nδ/2 Scale MAD: Median |xi − Median xj | i j IQR: x(n/4) − x(3n/4) Regression LMS: arg min Median(yi − β xi )2 β i 6 / 17
  • 7. Examples: multivariate robust estimates Minimum Covariance Determinant (MCD) by Rousseeuw (1985): minimize determinant of sample covariance of 50% of data points: 6 Sample Covariance 4 MCD 2 Clean 0 −2 −4 −6 7 / 17
  • 8. Weighted representation Many robust estimates can be represented as weighted versions of familiar estimates n i=1 wi xi ˆ µ= n i=1 wi n ˆ i=1 wi (xi − µ)(xi ˆ − µ) ˆ Σ= n , i=1 wi with weights depending on the estimates themselves ˆ ˆ wi = w(MD(xi ; µ, Σ)), where Mahalanobis Distances are given by MD(xi ; µ, Σ) = (xi − µ) Σ−1 (xi − µ). ˆ ˆ ˆ ˆ ˆ 8 / 17
  • 9. Contaminated cells not cases Traditional Contamination Independent Contamination ε = 10% q q 9 / 17
  • 10. Generalized Contamination Data entry errors, hardware malfunction, etc Can express as Xj = (1 − Bj )(XGood )j + Bj (XHorrible )j , for j = 1, . . . , p, or, in matrix form, as X = (1 − B)X Good + BX Horrible , where B is a vector of Bernoulli r.v.’s B’s dependence structure is important Will assume Independent Contamination: all Bj are independent and independent of X’s. Also: P[Bj = 1] = ε for simplicity. 10 / 17
  • 11. Number of clean cases each case will appear as outlier if diagnosed with MD’s P[case is clean] = (1 − ε)p e.g. with ε = 0.05 and p = 20 — only 20% are clean waste of data exceeds breakdown point of traditional robust estimates. 11 / 17
  • 12. Affine-equivariance Definition: if data set Y = A + XB, then ˆ ˆ µ(Y ) = A + B µ(Y ) ˆ ˆ Σ(Y ) = B ΣB, Desirable: easy to study etc Most “respectable” robust estimates are A-E Alqallaf et al (2009) have a proof that reasonable A-E estimates cannot be robust against IC if know how it behaves on X, then know for Y ; and vice versa 12 / 17
  • 13. Affine Transformation of Contaminated Data Original Contaminated Transformed X → Y = XB −→ q q 13 / 17
  • 14. Pairwise approach P[pair of variables are clean] = (1 − ε)2 (1 − ε)p ˆ Estimate all elements Σab , for a, b = 1, . . . , p separately Problem: multivariate structure is damaged/destroyed Particular problem: may not be positive-definite. May or may not be a problem. Usually is. Studied to some extent by Alqallaf (2003, PhD thesis) 14 / 17
  • 15. Detecting cells Some are obvious: univariate outliers Some only show up with respect to other cells: structural outliers Van Aelst et al (2009) use Stahel-Donoho projections Little and Smith (1987) used partial Mahalanobis distances: ˆ ˆ if MD(x; µ, Σ) is large, ˆ ˆ consider MD(x−j ; µ, Σ) for all j = 1, . . . , p. Mike explores MD-approach and iterative estimation of covariances in his thesis. 15 / 17
  • 16. Weighted estimate with cell weights Van Aelst et al (2009) proposed a weighted estimate, but it is pairwise and not SPD Mike knows how to deal with zero weights - remove the values and treat them as MCAR. Then do MLE via EM, for example. Proper cell-weighted estimate is still to be developed. 16 / 17
  • 17. The End 17 / 17