2. Feature Engneering
Feature Scaling
Train test split
Encoding techniques
Imbalanced data
Table of Contents
3. Feature Engineering
some times we are having more no.of columns it leads for the curse of dimensionality or
high dimensionality at that point model will not perform well
In this case we just remove the un usefull columns they are
droping columns (columns with more than 70% of Missing values)
columns with less variance (the spread of the data is very high )
columns with same co-relation with other columns
Finally we will go for PCA(principle component analysis) for unsupervised
machine learning algorithms
LDA (Latent dimension analysis ) for supervised machine learning models
4. Feature Scaling
MinMaxScaler
StandardScaler
x(i)norm = (xi- xmin )/(xmax- xmin )
x(i)std = ( x(i)- μx)/(σx )
*Min-max scaling scales the data to a limited range of values.
in standardization, we center the feature columns at mean 0 with standard deviation 1
so that the feature columns takes the form of a normal distribution,
5. Imbalanced Data
The problem of imbalanced classes arises when one set of classes
dominate over another set of classes
Example of imbalanced classes
Disease detection
Fraud detection
Spam filtering
Earthquake prediction
6. Different ways to fix the imbalanced data :
Undersampling methods
Oversampling
methods
Synthetic Minority Oversampling Technique
(SMOTE)
8. Encoding techniques
Usually Data in the form of three types :-
1. Nominal
2. Ordinal
3. Binary
The Data Variable that represents names, no.s, places with no
order
The Data Variable that represents the data with some order
The data that is in the form of binary like 0 or 1, yes or no, True or False etc…..
9.
10. 1.Binary: Either/or
Examples:
1. Yes, No
2. True, False
1.Ordinal: Specific ordered Groups.
Examples:
1. low, medium, high
2. cold, hot, lava Hot
1.Nominal: Unordered
Groups.
Examples
1. cat, dog, tiger
2. pizza, burger, coke
11. Encoding tech
Converting all catogeries in a variable into numerical
Encoding Types
1 . One hot encoding
2. Label Encoding
4. Mean Encoding
5. Top 10 Mean Frequency Encoding
3. Frequency Encoding
12. Splitting Data
Train test split is nothing but splitting the data into train data
1. In this splitting train data is for training the data
2. Test data is for test the model performance
1. x_train
2. x_test
3. y_train
4. y_test