Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Outlier detection

337 visualizaciones

Publicado el

Data analysis and mining is an important step for achieving two objectives related to research.The first one is it help to identiy whether the research topic is significant or have potential for doing exemplary research and the second one is it helps to find and explain the research outcome for a specific research problem. Among the methods of data analysis,outlier detection is a method where abnormality/extremity within a data set is detected and removed such that a regular trend can be easily detected from the collected data set. The regular trend helps to map the goal of research with its outcome.

Publicado en: Datos y análisis
  • Sé el primero en comentar

Outlier detection

  1. 1. Hydro-informatics Engg./Optimization Technique A method for detection of abnormal/extreme/uncommon data points among a set of data By Dr.Mrinmoy Majumder Outlier Detection
  2. 2.  Significance of Outlier Detection  Importance of Size of Sample Data  Importance of Distribution of Data-set  Chauvenet Method  Dixon Thompson Method Overview
  3. 3.  Various methods available  Selection of method based on sample size  Selection of method based on underlying distribution  Detection of extreme data  Not from the sample population  Either over or under mean value Significance of Outlier Detection
  4. 4.  Rosner Method >25  Dixon Thompson <25  Chauvenet Method Any Importance of Size of Sample Data Name of Method Size of Sample Data
  5. 5.  Distribution of sample data and the distribution critical deviate must be from the same distribution.  Underlying distribution selects the most appropriate method/test.  Error in selection of methods may ignore some of the outliers. Importance of Distribution of Sample Data
  6. 6. Chauvenet’s Method Z = (X – Mean of X) / (Standard Deviation of X) • Critical value is selected from the table of distribution of the sample data • To find the critical value first probability based on the size of data is calculated by : ( 1 / 2n ) • Then based on the probability critical value is derived. • If the value of Z is more than the critical value then the data will be an outlier. • The weakness of this method is it can detect only one outlier.
  7. 7. Normal Distribution Table for finding the Critical Value After 1/2n is derived to find the probability, the same was identified in the table and corresponding Z-value in the row and column is used to find the critical value of that probability. Example : Say 1/2n becomes 0.8508 Then, the corresponding z score will be z in row i.e. 1.0 and z in column ie.0.04. Adding both row and column z gives the critical value of Z which is : 1.04
  8. 8. Dixon Thompson Method 1. First, all the data are sorted in an ascending order 2. Smallest is ranked first and largest is ranked last. 3. Then test statistic and critical value of R is determined based on the sample size. • This method is used when direction of the test is predetermined. • If it is specified to find the outlier of the lowest value then this test is preferred to Chauvenet’s Method. • Can detect only one outlier • Used for one or two tailed test.
  9. 9. Test statistic and Critical value m is the sample size After deducing R the value of R is compared with the critical value found in the table as per the level of significance. If the data set have 5% level of significance and sample size of 10 data points then the critical value will be 0.472
  10. 10.  Many other Outlier Methods are available:  Z-Score or Extreme Value Analysis (parametric)  Probabilistic and Statistical Modeling (parametric)  Linear Regression Models (PCA, LMS)  Proximity Based Models (non-parametric)  Information Theory Models.  View the link for more details : techniques-1e0b2c19e561 Thank You