Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.
Próxima SlideShare
Cargando en…5
×

# Data cleaning and screening

Cairo University, Faculty of commerce, Business administration department, Pre-master class, Methodological studies.

#### Gratis con una prueba de 30 días de Scribd

Ver todo
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Sé el primero en comentar

### Data cleaning and screening

1. 1. Mohamed, Hassan Mohamed Hussein Business administration department Faculty of Commerce Cairo University Egypt 2016 Data screening and cleaning
2. 2. Agenda  Importance.  Data screening steps.  Data cleaning  Missing data  Normality  Linearity  Outliers  Multicollinearity  Homoscedasticity Hassan Mohamed Cairo University- Statistical Package, 2016
3. 3. Importance. Where you should clean your data in your research process?  Data cleaning and screening is the step that directly follows data entry and you must not start your analysis unless doing it.  Data screening importance:  It is very easy to make mistakes when entering data.  Some errors can miss up your analysis.  So, it is important to spend the time for checking for the mistakes initially, rather than trying to repair the damage later, try another person to check your data. Hassan Mohamed Cairo University- Statistical Package, 2016
4. 4. Data screening steps 1) Check out the abnormal data (data within out of range) from frequencies table. 2) Go back to the original questionnaire and correct them. Hassan Mohamed Cairo University- Statistical Package, 2016
5. 5. Data cleaning  Data cleaning includes:  Missing data  Normality  Linearity  Outliers  Multicollinearity  Homoscedasticity Hassan Mohamed Cairo University- Statistical Package, 2016
6. 6. Missing data - If Missing data comes from data entry:  You can detect it from the frequencies of the variable (missing #)  Then sort your data ascending or descending.  Then you got the IDs of missing values  Go back and try to fill it.  Run your descriptive analysis again. Hassan Mohamed Cairo University- Statistical Package, 2016
7. 7. Missing data (cont.) - If the data entry comes from respondent errors;  respondent was ambiguous  Respondent forgot to answer the question. • And missing data are more than 10% of the total values of the variable that has missing data. Then don’t treat with the missing data. Hassan Mohamed Cairo University- Statistical Package, 2016
8. 8. Missing data (cont.) • If the missing values are less than 10%: • You can deal with it: 1. Substitute it with the neutral value. (Malhotra, 2010) 2. Substitute with an imputed value: (hair et al.,2010)  Imputation using only valid data: Exclude cases listwise  Complete data. (Least preferable under 10% of missing data)  All available data. Hassan Mohamed Cairo University- Statistical Package, 2016
9. 9. Missing data (cont.)  Imputation using known replacement values:  Case substitute.  Hot and Cold Deck imputation (most similar case, or best known value)  Imputation by calculating replacement values: Replace with……  Mean substitution  Regression imputation (prediction equation of the valid data)  This option should never be used, as it can severely distort the results of your analysis. Hassan Mohamed Cairo University- Statistical Package, 2016
10. 10. Missing data (cont.) Or  Exclude cases pairwise (recommended)  Excludes the case only if they are missing the data required for the specific analysis. But still included in any other analysis. (Pallant, 2011) Hassan Mohamed Cairo University- Statistical Package, 2016
11. 11. Normality  The shape of the data distribution for an individual metric variable.  Used to describe a symmetrical, bell-shaped curve, which has the greatest frequency of scores in the middle with smaller frequencies towards the extremes  It is a must for any parametric analysis.  Normal distribution can be negligible if the sample size more than 50 respondents. Hassan Mohamed Cairo University- Statistical Package, 2016
12. 12. Normality (Cont.)  Normality measures:  Kurtosis:  Peakedness (Leptokurtic) or flatness (Platykurtic) of the distribution compared to the normal distribution.  In normal distribution the kurtosis value is zero (allowed to ±10)  Skewness:  The balance of the distribution  Positive distribution (left skewed) or Negative distribution (right skewed).  In normal distribution the skewness value is zero (allowed to ±3)Hassan Mohamed Cairo University- Statistical Package, 2016
13. 13. Normality (Cont.)  5% Trimmed Mean and mean values.  Kolmogorov-Smirnov and Shapiro-Wilk values are more than 0.05 indicates the normality. But it is very sensitive for the sample size more than 200.  Form the Pell shape in the histogram. Transformation can fix the nonnormal distribution. Hassan Mohamed Cairo University- Statistical Package, 2016
14. 14. Linearity  It is for multivariate techniques based on correlational measures of association including multiple regression. (hair et al., 2010)  The relationship between the two variables should be linear. This means that when you look at a scatterplot of scores you should see a straight line (roughly), not a curve (Curvilinear). (pallant, 2011).  Transformation can overcome the Curvilinear issue (hair et al., 2010)Hassan Mohamed Cairo University- Statistical Package, 2016
15. 15. Linearity (cont.)  So, shouldn’t transform your data to avoid non normal distribution If your sample more than 50.  But you should transform the data to avoid curvilinearity. Hassan Mohamed Cairo University- Statistical Package, 2016
16. 16. Outliers  These are case scores that are extreme and therefore have a much higher impact on the outcome of any statistical analysis.  It is not an error in your data, but it makes your data non representative its population (Income)  Can be detected using Box plots.  Outliers come from: (Hair et al.,2010; Tabachnick & Fidell, 1996)  There was a mistake in data entry (a 6 was entered as 66, etc.)  The missing values code was not specified and missing values are being read as case entries (99 in spss)Hassan Mohamed Cairo University- Statistical Package, 2016
17. 17. Outliers (cont.)  Outliers come from: (Hair et al.,2010; Tabachnick & Fidell, 1996)  There was a mistake in data entry (a 6 was entered as 66, etc.)  The missing values code was not specified and missing values are being read as case entries (99 in spss)  The outlier is not part of the population from which you intended to sample:  extraordinary event (remove it).  Extraordinary observation (take your decision depending on your valid cases) (close to eliminate)  Neutral value for all variables (close to retain)Hassan Mohamed Cairo University- Statistical Package, 2016
18. 18. Outliers (cont.)  The outlier is part of the population you wanted but in the distribution it is seen as an extreme case.  In this case you have three choices: 1) delete the extreme cases 2) change the outliers’ scores so that they are still extreme but they fit within a normal distribution (for example: make it a unit larger or smaller than last case that fits in the distribution) 3) if the outliers seem to part of an overall non-normal distribution than a transformation can be done but first check for normality Hassan Mohamed Cairo University- Statistical Package, 2016
19. 19. Outliers (cont.)  The outliers should be retained to ensure the generalizability of population unless they are not representative the population.  So, again shouldn’t transform your data to avoid non normal distribution If your sample more than 50.  But you should transform the data to avoid outliers. Hassan Mohamed Cairo University- Statistical Package, 2016
20. 20. Thank You Hassan Mohamed Cairo University- Statistical Package, 2016