Se ha denunciado esta presentación.
Se está descargando tu SlideShare. ×

preprocessing techniques.pptx

Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Cargando en…3
×

Eche un vistazo a continuación

1 de 12 Anuncio
Anuncio

Más Contenido Relacionado

Más reciente (20)

Anuncio

preprocessing techniques.pptx

  1. 1. Data preprocessing concepts
  2. 2.  Feature Engneering  Feature Scaling  Train test split  Encoding techniques  Imbalanced data Table of Contents
  3. 3. Feature Engineering  some times we are having more no.of columns it leads for the curse of dimensionality or high dimensionality at that point model will not perform well  In this case we just remove the un usefull columns they are droping columns (columns with more than 70% of Missing values)  columns with less variance (the spread of the data is very high )  columns with same co-relation with other columns  Finally we will go for PCA(principle component analysis) for unsupervised machine learning algorithms  LDA (Latent dimension analysis ) for supervised machine learning models
  4. 4. Feature Scaling MinMaxScaler StandardScaler x(i)norm = (xi- xmin )/(xmax- xmin ) x(i)std = ( x(i)- μx)/(σx ) *Min-max scaling scales the data to a limited range of values. in standardization, we center the feature columns at mean 0 with standard deviation 1 so that the feature columns takes the form of a normal distribution,
  5. 5. Imbalanced Data The problem of imbalanced classes arises when one set of classes dominate over another set of classes Example of imbalanced classes Disease detection Fraud detection Spam filtering Earthquake prediction
  6. 6. Different ways to fix the imbalanced data :  Undersampling methods  Oversampling methods  Synthetic Minority Oversampling Technique (SMOTE)
  7. 7. Creating synthetic samples
  8. 8. Encoding techniques Usually Data in the form of three types :- 1. Nominal 2. Ordinal 3. Binary The Data Variable that represents names, no.s, places with no order The Data Variable that represents the data with some order The data that is in the form of binary like 0 or 1, yes or no, True or False etc…..
  9. 9. 1.Binary: Either/or Examples: 1. Yes, No 2. True, False 1.Ordinal: Specific ordered Groups. Examples: 1. low, medium, high 2. cold, hot, lava Hot 1.Nominal: Unordered Groups. Examples 1. cat, dog, tiger 2. pizza, burger, coke
  10. 10. Encoding tech  Converting all catogeries in a variable into numerical Encoding Types 1 . One hot encoding 2. Label Encoding 4. Mean Encoding 5. Top 10 Mean Frequency Encoding 3. Frequency Encoding
  11. 11. Splitting Data Train test split is nothing but splitting the data into train data 1. In this splitting train data is for training the data 2. Test data is for test the model performance 1. x_train 2. x_test 3. y_train 4. y_test

×