SlideShare una empresa de Scribd logo
1 de 36
Descargar para leer sin conexión
Data Analysis Course
Preparing data for analysis
Venkat Reddy
Contents
• Need of data exploration
• Data Exploration
• Data Validation
• Data Sanitization
• Missing Value Treatment
• Outlier Treatment Identification & Treatment
DataAnalysisCourse
VenkatReddy
2
Main steps in statistical data analysis
DataAnalysisCourse
VenkatReddy
3
Remember…
Data in the real world is dirty
Incomplete: lacking attribute values, lacking certain attributes of interest,
or containing only aggregate data. e.g., occupation=“ ”
• Incomplete data may come from
• “Not applicable” data value when collected.
• Different considerations between the time when the data was collected and when it
is analyzed.
• Human/hardware/software problems
Noisy: Containing errors or outliers. e.g., Salary=“-10”
• Noisy data (incorrect values) may come from
• Faulty data collection instruments
• Human or computer error at data entry
• Errors in data transmission
Inconsistent: containing discrepancies in codes or names
• e.g., Age=“42” Birthday=“03/07/1997”, Was rating “1,2,3”, now rating “A, B,
C”,e.g., discrepancy between duplicate records
• Inconsistent data may come from
• Different data sources
• Functional dependency violation (e.g., modify some linked data)
DataAnalysisCourse
VenkatReddy
4
Missing values and Outliers
• How to identify the missing values?
• What are outliers?
• Sometimes outlier finding it self is the aim of the analysis
DataAnalysisCourse
VenkatReddy
5
Data & Objective
• Loans data: Historical data are provided on 250,000 borrowers.
• The objective is to build a model that borrowers can use to help make the best
financial decisions.
DataAnalysisCourse
VenkatReddy
6
Variable Name Description Type
SeriousDlqin2yrs Person experienced 90 days past due delinquency or worse Y/N
RevolvingUtilizationOfUnsecure
dLines
Total balance on credit cards and personal lines of credit except real
estate and no installment debt like car loans divided by the sum of
credit limits
percentage
age Age of borrower in years integer
NumberOfTime30-
59DaysPastDueNotWorse
Number of times borrower has been 30-59 days past due but no
worse in the last 2 years.
integer
DebtRatio
Monthly debt payments, alimony,living costs divided by monthy
gross income
percentage
MonthlyIncome Monthly income real
NumberOfOpenCreditLinesAndL
oans
Number of Open loans (installment like car loan or mortgage) and
Lines of credit (e.g. credit cards)
integer
NumberOfTimes90DaysLate Number of times borrower has been 90 days or more past due. integer
NumberRealEstateLoansOrLines
Number of mortgage and real estate loans including home equity
lines of credit
integer
NumberOfTime60-
89DaysPastDueNotWorse
Number of times borrower has been 60-89 days past due but no
worse in the last 2 years.
integer
NumberOfDependents
Number of dependents in family excluding themselves (spouse,
children etc.)
integer
Basic contents of the data
• What are total number of observations
• What are total number of fields
• Each field name, Field type, Length of field
• Format of field, Label
Basic Contents –Check points
• Are all variables as expected (variables names)
• Are there some variables which are unexpected say q9 r10?
• Are the data types and length across variables correct
• For known variables is the data type as expected (For example if age
is in date format something is suspicious)
• Have labels been provided and are sensible
If anything suspicious we can further investigate it and correct accordingly
DataAnalysisCourse
VenkatReddy
7
Lab: Basic contents of the
data
• Import Data_explore.csv into SAS
• What are basic contents of the data
• Verify the check list
• Any suspicious variables?
• What is var1?
• Are all the variable names correct?
DataAnalysisCourse
VenkatReddy
8
Proc Contents-SAS
• SAS code : proc contents data=<<data name>>; run;
• Useful options :
• Short – Outputs the list of variables in a row by row format.
Code : proc contents data=test short; run;
• Out=filename - Creates a data set wherein each observation is a
variable
DataAnalysisCourse
VenkatReddy
9
Snapshot of the data
Data Snapshot, if possible
• Printing the first few observations all fields in the data set .It helps in
better understanding of the variable by looking at it’s assigned values.
Checkpoints for data snapshot output:
1. Do we have any unique identifier? Is the unique identifier getting
repeated in different records?
2. Do the text variables have meaningful data?(If text variables have
absurd data as ‘&^%*HF’ then either the variable is meaningless or the variable has
become corrupt or wasn’t properly created.)
3. Are there some coded values in the data?(if for a known variable say
State we have category codes like 1-52 then we need definition of how they are
coded.)
4. Do all the variables appear to have data? (In case variables are not
populated with non missing meaningful value it would show in print. We can further
investigates using means statistics.)
DataAnalysisCourse
VenkatReddy
10
Proc print in SAS
SAS code : proc print data=<<data set>>; run;
• Useful options :
proc print data=<<data set>> label noobs heading=vertical;
var <<variable-list>>; by var1; run;
• Label:The label option uses variable labels as column headings rather than
variable names (the default).
• Obs : Restricts the number of observations in the output
• Nobs: It omits the OBS column of output.
• Heading=vertical: It prints the column headings vertically. This is useful when the
names are long but the values of the variable are short.
• Var: Specifies the variables to be listed and the order in which they will appear.
• By: By statement produces output grouped by values of the mentioned variables
DataAnalysisCourse
VenkatReddy
11
Lab: Data exploration & validation
• Print the first 10 observations
• Do we have any unique identifier?
• Do the text variables have meaningful data?
• Are there some coded values in the data?
• Do all the variables appear to have data
DataAnalysisCourse
VenkatReddy
12
Categorical field frequencies
• Calculate frequency counts cross-tabulation frequencies for
Especially for categorical, discrete & class fields
• Frequencies
• help us understanding the variable by looking at the values it’s taking
and data count at each value.
• They also helps us in analyzing the relationships between variables
by looking at the cross tab frequencies or by looking at association
Checkpoints for looking frequency table
1. Are values as expected?
2. Variable understanding : Distinct values of a particular variable,
missing percentages
3. Are there any extreme values or outliers?
4. Any possibility of creating a new variable having small number of
distinct category by clubbing certain categories with others.
DataAnalysisCourse
VenkatReddy
13
Proc Freq in SAS
• SAS code: Proc FREQ data =<dataset > <options> ;
TABLES requests < / options > ; // Gives Frequency Count or Cross Tab
BY <varl> ; // Grouping output based on varl
WEIGHT variable < / option > ; //Specifying Weight (if applicable)
OUTPUT < OUT=SAS-data-set > options ; //Output results to another data
set run;
• Useful options :
• Order=Freq - sorts by descending frequency count (default is the unformatted
value). Ex: proc freq data=test order=freq; tables X1-X5; run;
• Nocol/Norow/Nopercent - suppresses printing of column, row and cell
percentages respectively of a cross tab. Ex : proc freq data=test; tables
AGE*bad/nocol norow nopercent missing; run;
• Missing- interprets missing values as non-missing and includes them in % and
statistics calculations ex : proc freq data=test; tables CHANNEL* BAD
/missing; run;
• Chisq - performs several chi-square tests. Ex: proc freq data=test; tables
channel*bad/chisq; run;
DataAnalysisCourse
VenkatReddy
14
Lab: Frequencies
• Find the frequencies of all class variables in the data
• Are there any variables with missing values?
• Are there any default values?
• Can you identify the variables with outliers?
DataAnalysisCourse
VenkatReddy
15
Descriptive Statistics for continuous
fields
• Distribution of numeric variables by calculating
• N – Count of non missing observations
• Nmiss – Count of Missing observations
• Min, Max, Median, Mean
• Quartile numbers & percentiles– P1, p5,p10,q1(p25),q3(p75),
p90,p99
• Stddev
• Var
• Skewness
• Kurtosis
DataAnalysisCourse
VenkatReddy
16
Descriptive Statistics Checkpoints
• Are variable distribution as expected.
• What is the central tendency of the variable? Mean, Median and
Mode across each variable
• Is the concentration of variables as expected ? What are quartiles?
• Indicates variables which are unary I.e stddev=0 ; the variables
which are useless for the current objective.
• Are there any outliers / extreme values for the variable?
• Are outlier values as expected or they have abnormally high values -
for ex for Age if max and p99 values are 10000. Then should
investigate if it’s the default value or there is some error in data
• What is the % of missing value associated with the variable?
DataAnalysisCourse
VenkatReddy
17
Proc Univariate on continuous variables
• SAS Code : PROC UNIVARIATE data=<dataset>;
VAR variable(s); run;
• Useful options :
• PROC UNIVARIATE data=<dataset> plot normal;
HISTOGRAM <variable(s)> </ option(s)>;
By variable;
VAR variable(s); run;
• Normal option produces the tests of normality ;
• Plot option produces the 3 plots of data(stem and leaf plot, box plot,
normal probability plot
• By option is used for giving outputs separated by categories
• Histogram option gives the distribution of variable in a histogram
DataAnalysisCourse
VenkatReddy
18
Proc Means in SAS
• Proc means data=<data set> < options>;
Var <variable list >;
Run;
• If variable list is not mentioned it gives results across all numeric
variables
• If options are not specified by default it gives stats like – n , min,
max, mean and stddev.
• Useful Options :
• By : Calculates statistics based on grouping across specified
variable;
Proc means data=check n nmiss min max;
var age ;
class channel;
run;
DataAnalysisCourse
VenkatReddy
19
General Checks
• Mean=Median?
• Counted proportion data. If data consists of counted
proportions, e.g. number of individuals responding out of
total number of individuals,
• Data Sufficiency: Data Sufficiency involves ensuring that the
data has the required attributes to make the prediction as
stated by objective
• Eg1: To build a model to predict fraud, the given data doesn’t have
any key for identifying fraud accounts or those identifiers are erased
than there are no accounts which we can identify as ‘bad’ and build a
model to predict the same.
• Eg2: If we are building a response model specifically for internet
channel for a “airline card”. Then data should have a identifier for
‘channel of acquisition’ to identify the right data base on which to
build the model
DataAnalysisCourse
VenkatReddy
20
Lab: Data exploration & validation
• Find N, Average, sd, minimum & maximum
• Is N same for all the variables?
• Any variables with unusual min & max?
• Identify list of suspicious variables
• Find below statistics for all the doubtful variables
• N,Mean,Median,Mode
• Std Deviation
• Skewness
• Variance
• Kurtosis
• Interquartile Range
• Quantiles 100% Max, 99%,,95%,,90%, 75% Q3,50% Median,25%
Q1,10%,5%,1%,0% Min
• See the variable definitions and possible values
• Identify variables with missing values, default values & outliers
DataAnalysisCourse
VenkatReddy
21
Now what…?
• Some variables contain outliers
• Some variables have default values
• Some variables have missing values
• RevolvingUtilizationOfUnsecuredL
• NumberOfTime30_59DaysPastDueNotW
• Montly income has missing values
• Shall we delete them and go ahead with our analysis?
DataAnalysisCourse
VenkatReddy
22
Missing Values
• Data is not always available E.g., many tuples have no recorded
value for several attributes, such as customer income in sales
data
• Missing data may be due to
• Equipment malfunction
• Inconsistent with other recorded data and thus deleted
• Data not entered due to misunderstanding
• Certain data may not be considered important at the time of
entry
• Not register history or changes of the data
• Missing data may need to be inferred.
• Missing data - values, attributes, entire records, entire sections
• Missing values and defaults are indistinguishable
DataAnalysisCourse
VenkatReddy
23
Missing Value Imputation1
• Standalone imputation
• Mean, median, other point estimates
• Convenient, easy to implement
• Assume: Distribution of the missing values is the same as the
non-missing values.
• Does not take into account inter-relationships
• Eg: The average of available values is 11.4. Can we replace the
missing value in this table by 11.4 ?
DataAnalysisCourse
VenkatReddy
24
X1
11.0
11.1
11.9
10.9
10.8
.
11.5
11.6
11.6
11.4
11
12
11.8
11.4
11.9
Missing Value Imputation2
• Use attribute relationships
• Better imputation
• Two techniques
• Propensity score (nonparametric). Useful for
discrete variables
• Regression (parametric)
• There are two missing values in x2. What are the most
appropriate replacements
DataAnalysisCourse
VenkatReddy
25
X1 X2
-4 -12
2 6
-6 -18
8 24
-1
-4 -12
-5 -15
4 12
-4 -12
-5 -15
-2
4 12
10 30
-10 -30
-3 -9
Missing Value Imputation3
• There are two missing values in x2. Find the
most appropriate replacements
DataAnalysisCourse
VenkatReddy
26
X1 X2
4 1
5 1
4 1
3 1
3 1
4 1
5 1
3
31 0
39 0
32 0
37 0
32 0
32 0
32
Missing Value Imputation4
• What if more than 50% are missing?
• It doesn’t make sense to carry out the analysis on 205 or 30%
of the whole data and give inferences on overall data
• The best imputation is ignore the actual values and take
available or not available info
DataAnalysisCourse
VenkatReddy
27
Default Values Treatment
• Special or default values are values like 999 or 999999 which
fall outside the normal range of data .
• Example:
• Number of cards= 99999
• For instance a no. of bankcards variable usually has values from 0
to 100 but 99999 values in the data represent the population
which does not have any trade lines. Including them in the
regression as 99999 would skew the regression results, hence we
need to treat them accordingly.
• Special or default values also should be treated as we are
treating the missing values depending upon the no. of
categories and the % of default value.
DataAnalysisCourse
VenkatReddy
28
Outlier Treatment
• They can also be taken care of by capping or flooring
them to a realistic value / where the trend is being
maintained ( especially if the % of default value is very
less).
• Sometimes outlier finding it self is the aim of the
analysis
• Flooring
• Capping
• Treat as a separate segment
• We can cap the data at 40, anything above 40 is 40
DataAnalysisCourse
VenkatReddy
29
Player Age
26
39
37
23
24
27
29
35
29
30
21
25
58
60
39
26
32
21
24
20
35
25
20
MissingValue & Outlier Treatment
DataAnalysisCourse
VenkatReddy
30
Treatment
% of
Missing
No. of
categories
Missing
Treatment
Variable
<=25
<=50%
Impute with similar/closest bad rate group
If bad rate is very different from all other
categories, then assign a special value and
include in the regression analysis
>50% Create a dummy / indicator variable with missing
(as 1) vs. non missing category
>25
<=10% Impute with the mean / median value
>10% &
<=50%
Assign a special value to the missing category in
the original variable and create an indicator
variable with missing value as 1 and others as 0
>50%
Create a dummy / indicator variable with missing
value as 1 and others as 0
Lab: Data Cleaning step by step
• Var-1: Change the variable name to sr_no
• SeriousDlqin2yrs : Only training data has data objective variable test
data doesn’t have objective variable in it. Subset training data & test
from overall data
• Age: If age <21 make it 21
DataAnalysisCourse
VenkatReddy
31
Lab :Data Cleaning step by step
• RevolvingUtilizationOfUnsecuredL: What type of variable is this? What are the possible
values?
• Replace anything more than 1 with_____?
DataAnalysisCourse
VenkatReddy
32
Treatment% of Missing
No. of
categories
Missing
Treatment
Utilizatio
n pct
<=25
<=50%
Impute with similar/closest bad rate group
If bad rate is very different from all other categories,
then assign a special value and include in the
regression analysis
>50% Create a dummy / indicator variable with missing (as
1) vs. non missing category
>25
<=10% Impute with the mean / median value
>10% &
<=50%
Assign a special value to the missing category in the
original variable and create an indicator variable with
missing value as 1 and others as 0
>50%
Create a dummy / indicator variable with missing
value as 1 and others as 0
Lab : Data Cleaning step by step
NumberOfTime30_59DaysPastDueNotW
• Find bad rate in each category of this variable
• Replace 96 with _____? Replace 98 with_____?
DataAnalysisCourse
VenkatReddy
33
Treatment% of Missing
No. of
categories
Missing
Treatment
30_59Days
PastDue
<=25
<=50%
Impute with similar/closest bad rate
group
If bad rate is very different from all other categories,
then assign a special value and include in the
regression analysis
>50% Create a dummy / indicator variable with missing (as
1) vs. non missing category
>25
<=10% Impute with the mean / median value
>10% &
<=50%
Assign a special value to the missing category in the
original variable and create an indicator variable
with missing value as 1 and others as 0
>50%
Create a dummy / indicator variable with missing
value as 1 and others as 0
Lab :Data Cleaning step by step
• Monthly Income
DataAnalysisCourse
VenkatReddy
34
Treatment% of Missing
No. of
categories
Missing
Treatment
Monthly
Income
<=25
<=50%
Impute with similar/closest bad rate group
If bad rate is very different from all other categories,
then assign a special value and include in the
regression analysis
>50% Create a dummy / indicator variable with missing
(as 1) vs. non missing category
>25
<=10% Impute with the mean / median value
>10% &
<=50%
Assign a special value to the missing category in the
original variable and create an indicator variable
with missing value as 1 and others as 0
>50%
Create a dummy / indicator variable with missing
value as 1 and others as 0
Lab :Data Cleaning step by step
• Debt Ratio: Similar Imputation
• NumberOfOpenCreditLinesAndLoans : No clear evidence
• NumberOfTimes90DaysLate: Imputation similar to
NumberOfTime30_59DaysPastDueNotW
• NumberRealEstateLoansOrLines: : No clear evidence
• NumberOfTime60_89DaysPastDueNotW: Imputation similar
to NumberOfTime30_59DaysPastDueNotW
• NumberOfDependents: Impute with equal bad rate
DataAnalysisCourse
VenkatReddy
35
Variables & Treatment
DataAnalysisCourse
VenkatReddy
36
Old Var Type Treatment New Var
VAR1 Num Nothing
SeriousDlqin2yrs Num Nothing
RevolvingUtilizationOfUnsecuredL Num Impute with the mean Util
age Num flooring age1
NumberOfTime30_59DaysPastDueNotW Num Impute with the mean NumberOfTime30_59Da
ysPastDue1
DebtRatio Num Impute with the median DebtRatio1
MonthlyIncome Char Convert to num & create
a dummy var
ind_MonthlyIncome,
MonthlyIncome1
NumberOfOpenCreditLinesAndLoans Num Impute with median num_open_lines
NumberOfTimes90DaysLate Num Imputing & capping delq_90
NumberRealEstateLoansOrLines Num Capping num_loans
NumberOfTime60_89DaysPastDueNotW Num Capping & Imputing delq_60to89
NumberOfDependents Char
obs_type Char Subset training data Obs_type

Más contenido relacionado

La actualidad más candente

Data Science Interview Questions | Data Science Interview Questions And Answe...
Data Science Interview Questions | Data Science Interview Questions And Answe...Data Science Interview Questions | Data Science Interview Questions And Answe...
Data Science Interview Questions | Data Science Interview Questions And Answe...Simplilearn
 
Data structures and algorithm analysis in java
Data structures and algorithm analysis in javaData structures and algorithm analysis in java
Data structures and algorithm analysis in javaMuhammad Aleem Siddiqui
 
Top 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner PitfallsTop 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner PitfallsSri Ambati
 
Data mining technique for classification and feature evaluation using stream ...
Data mining technique for classification and feature evaluation using stream ...Data mining technique for classification and feature evaluation using stream ...
Data mining technique for classification and feature evaluation using stream ...ranjit banshpal
 
An Overview of Naïve Bayes Classifier
An Overview of Naïve Bayes Classifier An Overview of Naïve Bayes Classifier
An Overview of Naïve Bayes Classifier ananth
 
Introduction to Data Analytics with R
Introduction to Data Analytics with RIntroduction to Data Analytics with R
Introduction to Data Analytics with RWei Zhong Toh
 
Classification techniques in data mining
Classification techniques in data miningClassification techniques in data mining
Classification techniques in data miningKamal Acharya
 
Random forest using apache mahout
Random forest using apache mahoutRandom forest using apache mahout
Random forest using apache mahoutGaurav Kasliwal
 
Automatic Visualization - Leland Wilkinson, Chief Scientist, H2O.ai
Automatic Visualization - Leland Wilkinson, Chief Scientist, H2O.aiAutomatic Visualization - Leland Wilkinson, Chief Scientist, H2O.ai
Automatic Visualization - Leland Wilkinson, Chief Scientist, H2O.aiSri Ambati
 
Decision Trees - The Machine Learning Magic Unveiled
Decision Trees - The Machine Learning Magic UnveiledDecision Trees - The Machine Learning Magic Unveiled
Decision Trees - The Machine Learning Magic UnveiledLuca Zavarella
 
Market Basket Analysis in SQL Server Machine Learning Services
Market Basket Analysis in SQL Server Machine Learning ServicesMarket Basket Analysis in SQL Server Machine Learning Services
Market Basket Analysis in SQL Server Machine Learning ServicesLuca Zavarella
 
Feature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive modelsFeature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive modelsGabriel Moreira
 
Machine Learning - Dummy Variable Conversion
Machine Learning - Dummy Variable ConversionMachine Learning - Dummy Variable Conversion
Machine Learning - Dummy Variable ConversionAndrew Ferlitsch
 

La actualidad más candente (20)

Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Data Science Interview Questions | Data Science Interview Questions And Answe...
Data Science Interview Questions | Data Science Interview Questions And Answe...Data Science Interview Questions | Data Science Interview Questions And Answe...
Data Science Interview Questions | Data Science Interview Questions And Answe...
 
Data structures and algorithm analysis in java
Data structures and algorithm analysis in javaData structures and algorithm analysis in java
Data structures and algorithm analysis in java
 
Top 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner PitfallsTop 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner Pitfalls
 
Data mining technique for classification and feature evaluation using stream ...
Data mining technique for classification and feature evaluation using stream ...Data mining technique for classification and feature evaluation using stream ...
Data mining technique for classification and feature evaluation using stream ...
 
Machine Learning for Dummies
Machine Learning for DummiesMachine Learning for Dummies
Machine Learning for Dummies
 
Data Mining
Data MiningData Mining
Data Mining
 
Data1
Data1Data1
Data1
 
An Overview of Naïve Bayes Classifier
An Overview of Naïve Bayes Classifier An Overview of Naïve Bayes Classifier
An Overview of Naïve Bayes Classifier
 
Introduction to Data Analytics with R
Introduction to Data Analytics with RIntroduction to Data Analytics with R
Introduction to Data Analytics with R
 
Classification techniques in data mining
Classification techniques in data miningClassification techniques in data mining
Classification techniques in data mining
 
Random forest using apache mahout
Random forest using apache mahoutRandom forest using apache mahout
Random forest using apache mahout
 
Automatic Visualization - Leland Wilkinson, Chief Scientist, H2O.ai
Automatic Visualization - Leland Wilkinson, Chief Scientist, H2O.aiAutomatic Visualization - Leland Wilkinson, Chief Scientist, H2O.ai
Automatic Visualization - Leland Wilkinson, Chief Scientist, H2O.ai
 
Ppt shuai
Ppt shuaiPpt shuai
Ppt shuai
 
Decision Trees - The Machine Learning Magic Unveiled
Decision Trees - The Machine Learning Magic UnveiledDecision Trees - The Machine Learning Magic Unveiled
Decision Trees - The Machine Learning Magic Unveiled
 
L3. Decision Trees
L3. Decision TreesL3. Decision Trees
L3. Decision Trees
 
Market Basket Analysis in SQL Server Machine Learning Services
Market Basket Analysis in SQL Server Machine Learning ServicesMarket Basket Analysis in SQL Server Machine Learning Services
Market Basket Analysis in SQL Server Machine Learning Services
 
Feature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive modelsFeature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive models
 
An introduction to R
An introduction to RAn introduction to R
An introduction to R
 
Machine Learning - Dummy Variable Conversion
Machine Learning - Dummy Variable ConversionMachine Learning - Dummy Variable Conversion
Machine Learning - Dummy Variable Conversion
 

Destacado

Getting started with Tableau
Getting started with TableauGetting started with Tableau
Getting started with TableauParth Acharya
 
Tableau @ Facebook - Summer 2014
Tableau @ Facebook - Summer 2014Tableau @ Facebook - Summer 2014
Tableau @ Facebook - Summer 2014Andy Kriebel
 
Tableau Software - Business Analytics and Data Visualization
Tableau Software - Business Analytics and Data VisualizationTableau Software - Business Analytics and Data Visualization
Tableau Software - Business Analytics and Data Visualizationlesterathayde
 

Destacado (6)

Getting started with Tableau
Getting started with TableauGetting started with Tableau
Getting started with Tableau
 
Tableau @ Facebook - Summer 2014
Tableau @ Facebook - Summer 2014Tableau @ Facebook - Summer 2014
Tableau @ Facebook - Summer 2014
 
Neural Network Part-2
Neural Network Part-2Neural Network Part-2
Neural Network Part-2
 
Data Analyst - Interview Guide
Data Analyst - Interview GuideData Analyst - Interview Guide
Data Analyst - Interview Guide
 
Cluster Analysis for Dummies
Cluster Analysis for DummiesCluster Analysis for Dummies
Cluster Analysis for Dummies
 
Tableau Software - Business Analytics and Data Visualization
Tableau Software - Business Analytics and Data VisualizationTableau Software - Business Analytics and Data Visualization
Tableau Software - Business Analytics and Data Visualization
 

Similar a Data exploration validation and sanitization

Introduction of data science
Introduction of data scienceIntroduction of data science
Introduction of data scienceTanujaSomvanshi1
 
1. chapter i(pasw)
1. chapter i(pasw)1. chapter i(pasw)
1. chapter i(pasw)Chhom Karath
 
Introduction - Using Stata
Introduction - Using StataIntroduction - Using Stata
Introduction - Using StataRyan Herzog
 
1.1 introduction to Data Structures.ppt
1.1 introduction to Data Structures.ppt1.1 introduction to Data Structures.ppt
1.1 introduction to Data Structures.pptAshok280385
 
RSS 2012 Data Entry SPSS
RSS 2012 Data Entry SPSSRSS 2012 Data Entry SPSS
RSS 2012 Data Entry SPSSWesam Abuznadah
 
Unstructured data processing webinar 06272016
Unstructured data processing webinar 06272016Unstructured data processing webinar 06272016
Unstructured data processing webinar 06272016George Roth
 
Data analysis using spss
Data analysis using spssData analysis using spss
Data analysis using spssSyed Faisal
 
EXPLORATORY DATA ANALYSIS
EXPLORATORY DATA ANALYSISEXPLORATORY DATA ANALYSIS
EXPLORATORY DATA ANALYSISBabasID2
 
Lec 3 HRA SPSS
Lec 3 HRA SPSSLec 3 HRA SPSS
Lec 3 HRA SPSSpal83111
 
data analysis techniques and statistical softwares
data analysis techniques and statistical softwaresdata analysis techniques and statistical softwares
data analysis techniques and statistical softwaresDr.ammara khakwani
 
Data Preparation and Processing
Data Preparation and ProcessingData Preparation and Processing
Data Preparation and ProcessingMehul Gondaliya
 

Similar a Data exploration validation and sanitization (20)

Introduction of data science
Introduction of data scienceIntroduction of data science
Introduction of data science
 
1. chapter i(pasw)
1. chapter i(pasw)1. chapter i(pasw)
1. chapter i(pasw)
 
EDA
EDAEDA
EDA
 
Introduction - Using Stata
Introduction - Using StataIntroduction - Using Stata
Introduction - Using Stata
 
1.1 introduction to Data Structures.ppt
1.1 introduction to Data Structures.ppt1.1 introduction to Data Structures.ppt
1.1 introduction to Data Structures.ppt
 
RSS 2012 Data Entry SPSS
RSS 2012 Data Entry SPSSRSS 2012 Data Entry SPSS
RSS 2012 Data Entry SPSS
 
Introduction
IntroductionIntroduction
Introduction
 
Unstructured data processing webinar 06272016
Unstructured data processing webinar 06272016Unstructured data processing webinar 06272016
Unstructured data processing webinar 06272016
 
Data analysis using spss
Data analysis using spssData analysis using spss
Data analysis using spss
 
trs-3.ppt
trs-3.ppttrs-3.ppt
trs-3.ppt
 
trs-3.ppt
trs-3.ppttrs-3.ppt
trs-3.ppt
 
trs-3.ppt
trs-3.ppttrs-3.ppt
trs-3.ppt
 
trs-3.ppt
trs-3.ppttrs-3.ppt
trs-3.ppt
 
EXPLORATORY DATA ANALYSIS
EXPLORATORY DATA ANALYSISEXPLORATORY DATA ANALYSIS
EXPLORATORY DATA ANALYSIS
 
Lec 3 HRA SPSS
Lec 3 HRA SPSSLec 3 HRA SPSS
Lec 3 HRA SPSS
 
4 module 3 --
4 module 3 --4 module 3 --
4 module 3 --
 
EDA by Sastry.pptx
EDA by Sastry.pptxEDA by Sastry.pptx
EDA by Sastry.pptx
 
Spss beginners
Spss beginnersSpss beginners
Spss beginners
 
data analysis techniques and statistical softwares
data analysis techniques and statistical softwaresdata analysis techniques and statistical softwares
data analysis techniques and statistical softwares
 
Data Preparation and Processing
Data Preparation and ProcessingData Preparation and Processing
Data Preparation and Processing
 

Más de Venkata Reddy Konasani

Machine Learning Deep Learning AI and Data Science
Machine Learning Deep Learning AI and Data Science Machine Learning Deep Learning AI and Data Science
Machine Learning Deep Learning AI and Data Science Venkata Reddy Konasani
 
Model selection and cross validation techniques
Model selection and cross validation techniquesModel selection and cross validation techniques
Model selection and cross validation techniquesVenkata Reddy Konasani
 
Table of Contents - Practical Business Analytics using SAS
Table of Contents - Practical Business Analytics using SAS Table of Contents - Practical Business Analytics using SAS
Table of Contents - Practical Business Analytics using SAS Venkata Reddy Konasani
 
Learning Tableau - Data, Graphs, Filters, Dashboards and Advanced features
Learning Tableau -  Data, Graphs, Filters, Dashboards and Advanced featuresLearning Tableau -  Data, Graphs, Filters, Dashboards and Advanced features
Learning Tableau - Data, Graphs, Filters, Dashboards and Advanced featuresVenkata Reddy Konasani
 
Introduction to predictive modeling v1
Introduction to predictive modeling v1Introduction to predictive modeling v1
Introduction to predictive modeling v1Venkata Reddy Konasani
 
Model building in credit card and loan approval
Model building in credit card and loan approval Model building in credit card and loan approval
Model building in credit card and loan approval Venkata Reddy Konasani
 

Más de Venkata Reddy Konasani (20)

Transformers 101
Transformers 101 Transformers 101
Transformers 101
 
Machine Learning Deep Learning AI and Data Science
Machine Learning Deep Learning AI and Data Science Machine Learning Deep Learning AI and Data Science
Machine Learning Deep Learning AI and Data Science
 
Model selection and cross validation techniques
Model selection and cross validation techniquesModel selection and cross validation techniques
Model selection and cross validation techniques
 
GBM theory code and parameters
GBM theory code and parametersGBM theory code and parameters
GBM theory code and parameters
 
Neural Networks made easy
Neural Networks made easyNeural Networks made easy
Neural Networks made easy
 
Decision tree
Decision treeDecision tree
Decision tree
 
Credit Risk Model Building Steps
Credit Risk Model Building StepsCredit Risk Model Building Steps
Credit Risk Model Building Steps
 
Table of Contents - Practical Business Analytics using SAS
Table of Contents - Practical Business Analytics using SAS Table of Contents - Practical Business Analytics using SAS
Table of Contents - Practical Business Analytics using SAS
 
SAS basics Step by step learning
SAS basics Step by step learningSAS basics Step by step learning
SAS basics Step by step learning
 
Testing of hypothesis case study
Testing of hypothesis case study Testing of hypothesis case study
Testing of hypothesis case study
 
L101 predictive modeling case_study
L101 predictive modeling case_studyL101 predictive modeling case_study
L101 predictive modeling case_study
 
Learning Tableau - Data, Graphs, Filters, Dashboards and Advanced features
Learning Tableau -  Data, Graphs, Filters, Dashboards and Advanced featuresLearning Tableau -  Data, Graphs, Filters, Dashboards and Advanced features
Learning Tableau - Data, Graphs, Filters, Dashboards and Advanced features
 
Online data sources for analaysis
Online data sources for analaysis Online data sources for analaysis
Online data sources for analaysis
 
A data analyst view of Bigdata
A data analyst view of Bigdata A data analyst view of Bigdata
A data analyst view of Bigdata
 
ARIMA
ARIMA ARIMA
ARIMA
 
Introduction to predictive modeling v1
Introduction to predictive modeling v1Introduction to predictive modeling v1
Introduction to predictive modeling v1
 
Big data Introduction by Mohan
Big data Introduction by MohanBig data Introduction by Mohan
Big data Introduction by Mohan
 
Model building in credit card and loan approval
Model building in credit card and loan approval Model building in credit card and loan approval
Model building in credit card and loan approval
 
Testing of hypothesis
Testing of hypothesisTesting of hypothesis
Testing of hypothesis
 
Multiple regression
Multiple regressionMultiple regression
Multiple regression
 

Último

THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONHumphrey A Beña
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPCeline George
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Celine George
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management SystemChristalin Nelson
 
Activity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translationActivity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translationRosabel UA
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptxmary850239
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptxmary850239
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSJoshuaGantuangco2
 
ICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfVanessa Camilleri
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Celine George
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parentsnavabharathschool99
 
Integumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.pptIntegumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.pptshraddhaparab530
 
Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)cama23
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfTechSoup
 
Music 9 - 4th quarter - Vocal Music of the Romantic Period.pptx
Music 9 - 4th quarter - Vocal Music of the Romantic Period.pptxMusic 9 - 4th quarter - Vocal Music of the Romantic Period.pptx
Music 9 - 4th quarter - Vocal Music of the Romantic Period.pptxleah joy valeriano
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Celine George
 
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxCarlos105
 
ROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxVanesaIglesias10
 

Último (20)

THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
 
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptxLEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERP
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management System
 
Activity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translationActivity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translation
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
 
ICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdf
 
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptxYOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parents
 
Integumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.pptIntegumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.ppt
 
Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
 
Music 9 - 4th quarter - Vocal Music of the Romantic Period.pptx
Music 9 - 4th quarter - Vocal Music of the Romantic Period.pptxMusic 9 - 4th quarter - Vocal Music of the Romantic Period.pptx
Music 9 - 4th quarter - Vocal Music of the Romantic Period.pptx
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17
 
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
 
ROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptx
 

Data exploration validation and sanitization

  • 1. Data Analysis Course Preparing data for analysis Venkat Reddy
  • 2. Contents • Need of data exploration • Data Exploration • Data Validation • Data Sanitization • Missing Value Treatment • Outlier Treatment Identification & Treatment DataAnalysisCourse VenkatReddy 2
  • 3. Main steps in statistical data analysis DataAnalysisCourse VenkatReddy 3
  • 4. Remember… Data in the real world is dirty Incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data. e.g., occupation=“ ” • Incomplete data may come from • “Not applicable” data value when collected. • Different considerations between the time when the data was collected and when it is analyzed. • Human/hardware/software problems Noisy: Containing errors or outliers. e.g., Salary=“-10” • Noisy data (incorrect values) may come from • Faulty data collection instruments • Human or computer error at data entry • Errors in data transmission Inconsistent: containing discrepancies in codes or names • e.g., Age=“42” Birthday=“03/07/1997”, Was rating “1,2,3”, now rating “A, B, C”,e.g., discrepancy between duplicate records • Inconsistent data may come from • Different data sources • Functional dependency violation (e.g., modify some linked data) DataAnalysisCourse VenkatReddy 4
  • 5. Missing values and Outliers • How to identify the missing values? • What are outliers? • Sometimes outlier finding it self is the aim of the analysis DataAnalysisCourse VenkatReddy 5
  • 6. Data & Objective • Loans data: Historical data are provided on 250,000 borrowers. • The objective is to build a model that borrowers can use to help make the best financial decisions. DataAnalysisCourse VenkatReddy 6 Variable Name Description Type SeriousDlqin2yrs Person experienced 90 days past due delinquency or worse Y/N RevolvingUtilizationOfUnsecure dLines Total balance on credit cards and personal lines of credit except real estate and no installment debt like car loans divided by the sum of credit limits percentage age Age of borrower in years integer NumberOfTime30- 59DaysPastDueNotWorse Number of times borrower has been 30-59 days past due but no worse in the last 2 years. integer DebtRatio Monthly debt payments, alimony,living costs divided by monthy gross income percentage MonthlyIncome Monthly income real NumberOfOpenCreditLinesAndL oans Number of Open loans (installment like car loan or mortgage) and Lines of credit (e.g. credit cards) integer NumberOfTimes90DaysLate Number of times borrower has been 90 days or more past due. integer NumberRealEstateLoansOrLines Number of mortgage and real estate loans including home equity lines of credit integer NumberOfTime60- 89DaysPastDueNotWorse Number of times borrower has been 60-89 days past due but no worse in the last 2 years. integer NumberOfDependents Number of dependents in family excluding themselves (spouse, children etc.) integer
  • 7. Basic contents of the data • What are total number of observations • What are total number of fields • Each field name, Field type, Length of field • Format of field, Label Basic Contents –Check points • Are all variables as expected (variables names) • Are there some variables which are unexpected say q9 r10? • Are the data types and length across variables correct • For known variables is the data type as expected (For example if age is in date format something is suspicious) • Have labels been provided and are sensible If anything suspicious we can further investigate it and correct accordingly DataAnalysisCourse VenkatReddy 7
  • 8. Lab: Basic contents of the data • Import Data_explore.csv into SAS • What are basic contents of the data • Verify the check list • Any suspicious variables? • What is var1? • Are all the variable names correct? DataAnalysisCourse VenkatReddy 8
  • 9. Proc Contents-SAS • SAS code : proc contents data=<<data name>>; run; • Useful options : • Short – Outputs the list of variables in a row by row format. Code : proc contents data=test short; run; • Out=filename - Creates a data set wherein each observation is a variable DataAnalysisCourse VenkatReddy 9
  • 10. Snapshot of the data Data Snapshot, if possible • Printing the first few observations all fields in the data set .It helps in better understanding of the variable by looking at it’s assigned values. Checkpoints for data snapshot output: 1. Do we have any unique identifier? Is the unique identifier getting repeated in different records? 2. Do the text variables have meaningful data?(If text variables have absurd data as ‘&^%*HF’ then either the variable is meaningless or the variable has become corrupt or wasn’t properly created.) 3. Are there some coded values in the data?(if for a known variable say State we have category codes like 1-52 then we need definition of how they are coded.) 4. Do all the variables appear to have data? (In case variables are not populated with non missing meaningful value it would show in print. We can further investigates using means statistics.) DataAnalysisCourse VenkatReddy 10
  • 11. Proc print in SAS SAS code : proc print data=<<data set>>; run; • Useful options : proc print data=<<data set>> label noobs heading=vertical; var <<variable-list>>; by var1; run; • Label:The label option uses variable labels as column headings rather than variable names (the default). • Obs : Restricts the number of observations in the output • Nobs: It omits the OBS column of output. • Heading=vertical: It prints the column headings vertically. This is useful when the names are long but the values of the variable are short. • Var: Specifies the variables to be listed and the order in which they will appear. • By: By statement produces output grouped by values of the mentioned variables DataAnalysisCourse VenkatReddy 11
  • 12. Lab: Data exploration & validation • Print the first 10 observations • Do we have any unique identifier? • Do the text variables have meaningful data? • Are there some coded values in the data? • Do all the variables appear to have data DataAnalysisCourse VenkatReddy 12
  • 13. Categorical field frequencies • Calculate frequency counts cross-tabulation frequencies for Especially for categorical, discrete & class fields • Frequencies • help us understanding the variable by looking at the values it’s taking and data count at each value. • They also helps us in analyzing the relationships between variables by looking at the cross tab frequencies or by looking at association Checkpoints for looking frequency table 1. Are values as expected? 2. Variable understanding : Distinct values of a particular variable, missing percentages 3. Are there any extreme values or outliers? 4. Any possibility of creating a new variable having small number of distinct category by clubbing certain categories with others. DataAnalysisCourse VenkatReddy 13
  • 14. Proc Freq in SAS • SAS code: Proc FREQ data =<dataset > <options> ; TABLES requests < / options > ; // Gives Frequency Count or Cross Tab BY <varl> ; // Grouping output based on varl WEIGHT variable < / option > ; //Specifying Weight (if applicable) OUTPUT < OUT=SAS-data-set > options ; //Output results to another data set run; • Useful options : • Order=Freq - sorts by descending frequency count (default is the unformatted value). Ex: proc freq data=test order=freq; tables X1-X5; run; • Nocol/Norow/Nopercent - suppresses printing of column, row and cell percentages respectively of a cross tab. Ex : proc freq data=test; tables AGE*bad/nocol norow nopercent missing; run; • Missing- interprets missing values as non-missing and includes them in % and statistics calculations ex : proc freq data=test; tables CHANNEL* BAD /missing; run; • Chisq - performs several chi-square tests. Ex: proc freq data=test; tables channel*bad/chisq; run; DataAnalysisCourse VenkatReddy 14
  • 15. Lab: Frequencies • Find the frequencies of all class variables in the data • Are there any variables with missing values? • Are there any default values? • Can you identify the variables with outliers? DataAnalysisCourse VenkatReddy 15
  • 16. Descriptive Statistics for continuous fields • Distribution of numeric variables by calculating • N – Count of non missing observations • Nmiss – Count of Missing observations • Min, Max, Median, Mean • Quartile numbers & percentiles– P1, p5,p10,q1(p25),q3(p75), p90,p99 • Stddev • Var • Skewness • Kurtosis DataAnalysisCourse VenkatReddy 16
  • 17. Descriptive Statistics Checkpoints • Are variable distribution as expected. • What is the central tendency of the variable? Mean, Median and Mode across each variable • Is the concentration of variables as expected ? What are quartiles? • Indicates variables which are unary I.e stddev=0 ; the variables which are useless for the current objective. • Are there any outliers / extreme values for the variable? • Are outlier values as expected or they have abnormally high values - for ex for Age if max and p99 values are 10000. Then should investigate if it’s the default value or there is some error in data • What is the % of missing value associated with the variable? DataAnalysisCourse VenkatReddy 17
  • 18. Proc Univariate on continuous variables • SAS Code : PROC UNIVARIATE data=<dataset>; VAR variable(s); run; • Useful options : • PROC UNIVARIATE data=<dataset> plot normal; HISTOGRAM <variable(s)> </ option(s)>; By variable; VAR variable(s); run; • Normal option produces the tests of normality ; • Plot option produces the 3 plots of data(stem and leaf plot, box plot, normal probability plot • By option is used for giving outputs separated by categories • Histogram option gives the distribution of variable in a histogram DataAnalysisCourse VenkatReddy 18
  • 19. Proc Means in SAS • Proc means data=<data set> < options>; Var <variable list >; Run; • If variable list is not mentioned it gives results across all numeric variables • If options are not specified by default it gives stats like – n , min, max, mean and stddev. • Useful Options : • By : Calculates statistics based on grouping across specified variable; Proc means data=check n nmiss min max; var age ; class channel; run; DataAnalysisCourse VenkatReddy 19
  • 20. General Checks • Mean=Median? • Counted proportion data. If data consists of counted proportions, e.g. number of individuals responding out of total number of individuals, • Data Sufficiency: Data Sufficiency involves ensuring that the data has the required attributes to make the prediction as stated by objective • Eg1: To build a model to predict fraud, the given data doesn’t have any key for identifying fraud accounts or those identifiers are erased than there are no accounts which we can identify as ‘bad’ and build a model to predict the same. • Eg2: If we are building a response model specifically for internet channel for a “airline card”. Then data should have a identifier for ‘channel of acquisition’ to identify the right data base on which to build the model DataAnalysisCourse VenkatReddy 20
  • 21. Lab: Data exploration & validation • Find N, Average, sd, minimum & maximum • Is N same for all the variables? • Any variables with unusual min & max? • Identify list of suspicious variables • Find below statistics for all the doubtful variables • N,Mean,Median,Mode • Std Deviation • Skewness • Variance • Kurtosis • Interquartile Range • Quantiles 100% Max, 99%,,95%,,90%, 75% Q3,50% Median,25% Q1,10%,5%,1%,0% Min • See the variable definitions and possible values • Identify variables with missing values, default values & outliers DataAnalysisCourse VenkatReddy 21
  • 22. Now what…? • Some variables contain outliers • Some variables have default values • Some variables have missing values • RevolvingUtilizationOfUnsecuredL • NumberOfTime30_59DaysPastDueNotW • Montly income has missing values • Shall we delete them and go ahead with our analysis? DataAnalysisCourse VenkatReddy 22
  • 23. Missing Values • Data is not always available E.g., many tuples have no recorded value for several attributes, such as customer income in sales data • Missing data may be due to • Equipment malfunction • Inconsistent with other recorded data and thus deleted • Data not entered due to misunderstanding • Certain data may not be considered important at the time of entry • Not register history or changes of the data • Missing data may need to be inferred. • Missing data - values, attributes, entire records, entire sections • Missing values and defaults are indistinguishable DataAnalysisCourse VenkatReddy 23
  • 24. Missing Value Imputation1 • Standalone imputation • Mean, median, other point estimates • Convenient, easy to implement • Assume: Distribution of the missing values is the same as the non-missing values. • Does not take into account inter-relationships • Eg: The average of available values is 11.4. Can we replace the missing value in this table by 11.4 ? DataAnalysisCourse VenkatReddy 24 X1 11.0 11.1 11.9 10.9 10.8 . 11.5 11.6 11.6 11.4 11 12 11.8 11.4 11.9
  • 25. Missing Value Imputation2 • Use attribute relationships • Better imputation • Two techniques • Propensity score (nonparametric). Useful for discrete variables • Regression (parametric) • There are two missing values in x2. What are the most appropriate replacements DataAnalysisCourse VenkatReddy 25 X1 X2 -4 -12 2 6 -6 -18 8 24 -1 -4 -12 -5 -15 4 12 -4 -12 -5 -15 -2 4 12 10 30 -10 -30 -3 -9
  • 26. Missing Value Imputation3 • There are two missing values in x2. Find the most appropriate replacements DataAnalysisCourse VenkatReddy 26 X1 X2 4 1 5 1 4 1 3 1 3 1 4 1 5 1 3 31 0 39 0 32 0 37 0 32 0 32 0 32
  • 27. Missing Value Imputation4 • What if more than 50% are missing? • It doesn’t make sense to carry out the analysis on 205 or 30% of the whole data and give inferences on overall data • The best imputation is ignore the actual values and take available or not available info DataAnalysisCourse VenkatReddy 27
  • 28. Default Values Treatment • Special or default values are values like 999 or 999999 which fall outside the normal range of data . • Example: • Number of cards= 99999 • For instance a no. of bankcards variable usually has values from 0 to 100 but 99999 values in the data represent the population which does not have any trade lines. Including them in the regression as 99999 would skew the regression results, hence we need to treat them accordingly. • Special or default values also should be treated as we are treating the missing values depending upon the no. of categories and the % of default value. DataAnalysisCourse VenkatReddy 28
  • 29. Outlier Treatment • They can also be taken care of by capping or flooring them to a realistic value / where the trend is being maintained ( especially if the % of default value is very less). • Sometimes outlier finding it self is the aim of the analysis • Flooring • Capping • Treat as a separate segment • We can cap the data at 40, anything above 40 is 40 DataAnalysisCourse VenkatReddy 29 Player Age 26 39 37 23 24 27 29 35 29 30 21 25 58 60 39 26 32 21 24 20 35 25 20
  • 30. MissingValue & Outlier Treatment DataAnalysisCourse VenkatReddy 30 Treatment % of Missing No. of categories Missing Treatment Variable <=25 <=50% Impute with similar/closest bad rate group If bad rate is very different from all other categories, then assign a special value and include in the regression analysis >50% Create a dummy / indicator variable with missing (as 1) vs. non missing category >25 <=10% Impute with the mean / median value >10% & <=50% Assign a special value to the missing category in the original variable and create an indicator variable with missing value as 1 and others as 0 >50% Create a dummy / indicator variable with missing value as 1 and others as 0
  • 31. Lab: Data Cleaning step by step • Var-1: Change the variable name to sr_no • SeriousDlqin2yrs : Only training data has data objective variable test data doesn’t have objective variable in it. Subset training data & test from overall data • Age: If age <21 make it 21 DataAnalysisCourse VenkatReddy 31
  • 32. Lab :Data Cleaning step by step • RevolvingUtilizationOfUnsecuredL: What type of variable is this? What are the possible values? • Replace anything more than 1 with_____? DataAnalysisCourse VenkatReddy 32 Treatment% of Missing No. of categories Missing Treatment Utilizatio n pct <=25 <=50% Impute with similar/closest bad rate group If bad rate is very different from all other categories, then assign a special value and include in the regression analysis >50% Create a dummy / indicator variable with missing (as 1) vs. non missing category >25 <=10% Impute with the mean / median value >10% & <=50% Assign a special value to the missing category in the original variable and create an indicator variable with missing value as 1 and others as 0 >50% Create a dummy / indicator variable with missing value as 1 and others as 0
  • 33. Lab : Data Cleaning step by step NumberOfTime30_59DaysPastDueNotW • Find bad rate in each category of this variable • Replace 96 with _____? Replace 98 with_____? DataAnalysisCourse VenkatReddy 33 Treatment% of Missing No. of categories Missing Treatment 30_59Days PastDue <=25 <=50% Impute with similar/closest bad rate group If bad rate is very different from all other categories, then assign a special value and include in the regression analysis >50% Create a dummy / indicator variable with missing (as 1) vs. non missing category >25 <=10% Impute with the mean / median value >10% & <=50% Assign a special value to the missing category in the original variable and create an indicator variable with missing value as 1 and others as 0 >50% Create a dummy / indicator variable with missing value as 1 and others as 0
  • 34. Lab :Data Cleaning step by step • Monthly Income DataAnalysisCourse VenkatReddy 34 Treatment% of Missing No. of categories Missing Treatment Monthly Income <=25 <=50% Impute with similar/closest bad rate group If bad rate is very different from all other categories, then assign a special value and include in the regression analysis >50% Create a dummy / indicator variable with missing (as 1) vs. non missing category >25 <=10% Impute with the mean / median value >10% & <=50% Assign a special value to the missing category in the original variable and create an indicator variable with missing value as 1 and others as 0 >50% Create a dummy / indicator variable with missing value as 1 and others as 0
  • 35. Lab :Data Cleaning step by step • Debt Ratio: Similar Imputation • NumberOfOpenCreditLinesAndLoans : No clear evidence • NumberOfTimes90DaysLate: Imputation similar to NumberOfTime30_59DaysPastDueNotW • NumberRealEstateLoansOrLines: : No clear evidence • NumberOfTime60_89DaysPastDueNotW: Imputation similar to NumberOfTime30_59DaysPastDueNotW • NumberOfDependents: Impute with equal bad rate DataAnalysisCourse VenkatReddy 35
  • 36. Variables & Treatment DataAnalysisCourse VenkatReddy 36 Old Var Type Treatment New Var VAR1 Num Nothing SeriousDlqin2yrs Num Nothing RevolvingUtilizationOfUnsecuredL Num Impute with the mean Util age Num flooring age1 NumberOfTime30_59DaysPastDueNotW Num Impute with the mean NumberOfTime30_59Da ysPastDue1 DebtRatio Num Impute with the median DebtRatio1 MonthlyIncome Char Convert to num & create a dummy var ind_MonthlyIncome, MonthlyIncome1 NumberOfOpenCreditLinesAndLoans Num Impute with median num_open_lines NumberOfTimes90DaysLate Num Imputing & capping delq_90 NumberRealEstateLoansOrLines Num Capping num_loans NumberOfTime60_89DaysPastDueNotW Num Capping & Imputing delq_60to89 NumberOfDependents Char obs_type Char Subset training data Obs_type