preprocessing techniques.pptx

 Feature Engneering
 Feature Scaling
 Train test split
 Encoding techniques
 Imbalanced data
Table of Contents

Feature Engineering
 some times we are having more no.of columns it leads for the curse of dimensionality or
high dimensionality at that point model will not perform well
 In this case we just remove the un usefull columns they are
droping columns (columns with more than 70% of Missing values)
 columns with less variance (the spread of the data is very high )
 columns with same co-relation with other columns
 Finally we will go for PCA(principle component analysis) for unsupervised
machine learning algorithms
 LDA (Latent dimension analysis ) for supervised machine learning models

Feature Scaling
MinMaxScaler
StandardScaler
x(i)norm = (xi- xmin )/(xmax- xmin )
x(i)std = ( x(i)- μx)/(σx )
*Min-max scaling scales the data to a limited range of values.
in standardization, we center the feature columns at mean 0 with standard deviation 1
so that the feature columns takes the form of a normal distribution,

Imbalanced Data
The problem of imbalanced classes arises when one set of classes
dominate over another set of classes
Example of imbalanced classes
Disease detection
Fraud detection
Spam filtering
Earthquake prediction

Different ways to fix the imbalanced data :
 Undersampling methods
 Oversampling
methods
 Synthetic Minority Oversampling Technique
(SMOTE)

Encoding techniques
Usually Data in the form of three types :-
1. Nominal
2. Ordinal
3. Binary
The Data Variable that represents names, no.s, places with no
order
The Data Variable that represents the data with some order
The data that is in the form of binary like 0 or 1, yes or no, True or False etc…..

1.Binary: Either/or
Examples:
1. Yes, No
2. True, False
1.Ordinal: Specific ordered Groups.
Examples:
1. low, medium, high
2. cold, hot, lava Hot
1.Nominal: Unordered
Groups.
Examples
1. cat, dog, tiger
2. pizza, burger, coke

Encoding tech
 Converting all catogeries in a variable into numerical
Encoding Types
1 . One hot encoding
2. Label Encoding
4. Mean Encoding
5. Top 10 Mean Frequency Encoding
3. Frequency Encoding

Splitting Data
Train test split is nothing but splitting the data into train data
1. In this splitting train data is for training the data
2. Test data is for test the model performance
1. x_train
2. x_test
3. y_train
4. y_test

preprocessing techniques.pptx

Recommended

Recommended

More Related Content

Similar to preprocessing techniques.pptx

Similar to preprocessing techniques.pptx (20)

Recently uploaded

Recently uploaded (20)

preprocessing techniques.pptx