SlideShare una empresa de Scribd logo
1 de 20
Data Preprocessing
Dataset
Country Age Salary Purchased
France 44 72000 No
Spain 27 48000 Yes
Germany 30 54000 No
Spain 38 61000 No
Germany 40 Yes
France 35 58000 Yes
Spain 52000 No
France 48 79000 Yes
Germany 50 83000 No
France 37 67000 Yes
Python
File Reading from directory in python
• from tkinter import *
• from tkinter.filedialog import askopenfilename
• root = Tk()
• root.withdraw()
• root.update()
• file_path = askopenfilename()
• root.destroy()
Importing the
libraries
• import numpy as np
• import matplotlib.pyplot as plt
• import pandas as pd
Importing the
dataset
• dataset = pd.read_csv('Data.csv')
• X = dataset.iloc[:, :-1].values
• y = dataset.iloc[:, 3].values
missing data
• from sklearn.preprocessing import Imputer
• imputer = Imputer(missing_values = 'NaN',
strategy = 'mean', axis = 0)
• imputer = imputer.fit(X[:, 1:3])
• X[:, 1:3] = imputer.transform(X[:, 1:3])
Encoding
categorical
data
• from sklearn.preprocessing import
LabelEncoder, OneHotEncoder
• labelencoder_X = LabelEncoder()
• X[:, 0] = labelencoder_X.fit_transform(X[:, 0])
• onehotencoder =
OneHotEncoder(categorical_features = [0])
• X = onehotencoder.fit_transform(X).toarray()
Encoding the
Dependent
Variable
• labelencoder_y = LabelEncoder()
• y = labelencoder_y.fit_transform(y)
Splitting into
Training set
and Test set
• from sklearn.cross_validation import
train_test_split
• X_train, X_test, y_train, y_test =
train_test_split(X, y, test_size = 0.2,
random_state = 42)
Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
NOTE : Apply feature scaling after splitting the data and it is
because the following
• Split it, then scale. Imagine it this way: you have no idea
what real-world data looks like, so you couldn't scale the
training data to it. Your test data is the surrogate for real-
world data, so you should treat it the same way.
• To reiterate: Split, scale your training data, then use the
scaling from your training data on the testing data.
Checking
NULL
• dataset.isnull()
• dataset.isnull().sum()
• Note : dataset is a dataframe
# Data Preprocessing Python
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset = pd.read_csv('Data.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 3].values
# Taking care of missing data
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)
imputer = imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])
# Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
sc_y = StandardScaler()
y_train = sc_y.fit_transform(y_train)
R
R : Importing the
dataset
dataset = read.csv('Data.csv')
R : missing
data
• dataset$Age = ifelse(is.na(dataset$Age),
ave(dataset$Age, FUN = function(x) mean(x,
na.rm = TRUE)),
dataset$Age)
• dataset$Salary = ifelse(is.na(dataset$Salary),
ave(dataset$Salary, FUN = function(x) mean(x,
na.rm = TRUE)),
dataset$Salary)
Encoding
categorical
data
• dataset$Country = factor(dataset$Country,
levels = c('France', 'Spain', 'Germany’),
labels = c(1, 2, 3))
• dataset$Purchased = factor(dataset$Purchased,
levels = c('No', 'Yes’),
labels = c(0, 1))
R : Splitting Training
set and Test set
• PACKAGES :
• install.packages('caTools')
• library(caTools)
• set.seed(123)
split =
sample.split(dataset$DependentVariable,
SplitRatio = 0.8)
training_set = subset(dataset, split ==
TRUE)
test_set = subset(dataset, split == FALSE)
R: Feature Scaling
training_set = scale(training_set)
test_set = scale(test_set)
NOTE : we cant apply the feature scaling to
categorical data in R like python. Here we
have to apply feature selection to only non
categorical features. So our code becomes :
training_set[, 2:3] = scale(training_set [, 2:3])
test_set = scale(test_set [, 2:3])
# Data Preprocessing R
# Importing the dataset
dataset = read.csv('Data.csv')
# Taking care of missing data
dataset$Age = ifelse(is.na(dataset$Age),
ave(dataset$Age, FUN = function(x) mean(x, na.rm = TRUE)),
dataset$Age)
dataset$Salary = ifelse(is.na(dataset$Salary),
ave(dataset$Salary, FUN = function(x) mean(x, na.rm = TRUE)),
dataset$Salary)
# Splitting the dataset into the Training set and Test set
# install.packages('caTools')
library(caTools)
set.seed(123)
split = sample.split(dataset$DependentVariable, SplitRatio = 0.8)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)
# Feature Scaling
training_set = scale(training_set)
test_set = scale(test_set)

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Regression kriging
Regression krigingRegression kriging
Regression kriging
 
4 R Tutorial DPLYR Apply Function
4 R Tutorial DPLYR Apply Function4 R Tutorial DPLYR Apply Function
4 R Tutorial DPLYR Apply Function
 
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its authorKaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
 
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
 
Cubist
CubistCubist
Cubist
 
Introduction of Xgboost
Introduction of XgboostIntroduction of Xgboost
Introduction of Xgboost
 
Xgboost
XgboostXgboost
Xgboost
 
5. R basics
5. R basics5. R basics
5. R basics
 
10. Getting Spatial
10. Getting Spatial10. Getting Spatial
10. Getting Spatial
 
Gradient boosting in practice: a deep dive into xgboost
Gradient boosting in practice: a deep dive into xgboostGradient boosting in practice: a deep dive into xgboost
Gradient boosting in practice: a deep dive into xgboost
 
9 python data structure-2
9 python data structure-29 python data structure-2
9 python data structure-2
 
Gradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnGradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learn
 
Array 31.8.2020 updated
Array 31.8.2020 updatedArray 31.8.2020 updated
Array 31.8.2020 updated
 
Data analysis with R
Data analysis with RData analysis with R
Data analysis with R
 
Unit 2 dsa LINEAR DATA STRUCTURE
Unit 2 dsa LINEAR DATA STRUCTUREUnit 2 dsa LINEAR DATA STRUCTURE
Unit 2 dsa LINEAR DATA STRUCTURE
 
R basics
R basicsR basics
R basics
 
Optimization toolbox presentation
Optimization toolbox presentationOptimization toolbox presentation
Optimization toolbox presentation
 
Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...
Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...
Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...
 
R programming intro with examples
R programming intro with examplesR programming intro with examples
R programming intro with examples
 
Second chapter-java
Second chapter-javaSecond chapter-java
Second chapter-java
 

Similar a Data preprocessing for Machine Learning with R and Python

Julie Michelman - Pandas, Pipelines, and Custom Transformers
Julie Michelman - Pandas, Pipelines, and Custom TransformersJulie Michelman - Pandas, Pipelines, and Custom Transformers
Julie Michelman - Pandas, Pipelines, and Custom Transformers
PyData
 

Similar a Data preprocessing for Machine Learning with R and Python (20)

Machine Learning Algorithms
Machine Learning AlgorithmsMachine Learning Algorithms
Machine Learning Algorithms
 
logistic regression with python and R
logistic regression with python and Rlogistic regression with python and R
logistic regression with python and R
 
knn classification
knn classificationknn classification
knn classification
 
Scikit learn cheat_sheet_python
Scikit learn cheat_sheet_pythonScikit learn cheat_sheet_python
Scikit learn cheat_sheet_python
 
Scikit-learn Cheatsheet-Python
Scikit-learn Cheatsheet-PythonScikit-learn Cheatsheet-Python
Scikit-learn Cheatsheet-Python
 
Cheat Sheet for Machine Learning in Python: Scikit-learn
Cheat Sheet for Machine Learning in Python: Scikit-learnCheat Sheet for Machine Learning in Python: Scikit-learn
Cheat Sheet for Machine Learning in Python: Scikit-learn
 
Introduction to deep learning using python
Introduction to deep learning using pythonIntroduction to deep learning using python
Introduction to deep learning using python
 
multiple linear regression
multiple linear regressionmultiple linear regression
multiple linear regression
 
Dimension reduction techniques[Feature Selection]
Dimension reduction techniques[Feature Selection]Dimension reduction techniques[Feature Selection]
Dimension reduction techniques[Feature Selection]
 
ML .pptx
ML .pptxML .pptx
ML .pptx
 
Pythonで機械学習入門以前
Pythonで機械学習入門以前Pythonで機械学習入門以前
Pythonで機械学習入門以前
 
Pythonbrasil - 2018 - Acelerando Soluções com GPU
Pythonbrasil - 2018 - Acelerando Soluções com GPUPythonbrasil - 2018 - Acelerando Soluções com GPU
Pythonbrasil - 2018 - Acelerando Soluções com GPU
 
Julie Michelman - Pandas, Pipelines, and Custom Transformers
Julie Michelman - Pandas, Pipelines, and Custom TransformersJulie Michelman - Pandas, Pipelines, and Custom Transformers
Julie Michelman - Pandas, Pipelines, and Custom Transformers
 
simple linear regression
simple linear regressionsimple linear regression
simple linear regression
 
Statistics in Data Science with Python
Statistics in Data Science with PythonStatistics in Data Science with Python
Statistics in Data Science with Python
 
Naïve Bayes.pptx
Naïve Bayes.pptxNaïve Bayes.pptx
Naïve Bayes.pptx
 
ML with python.pdf
ML with python.pdfML with python.pdf
ML with python.pdf
 
AI02_Python (cont.).pptx
AI02_Python (cont.).pptxAI02_Python (cont.).pptx
AI02_Python (cont.).pptx
 
Numerical tour in the Python eco-system: Python, NumPy, scikit-learn
Numerical tour in the Python eco-system: Python, NumPy, scikit-learnNumerical tour in the Python eco-system: Python, NumPy, scikit-learn
Numerical tour in the Python eco-system: Python, NumPy, scikit-learn
 
Multi Linesr Regresion.pptx
Multi Linesr Regresion.pptxMulti Linesr Regresion.pptx
Multi Linesr Regresion.pptx
 

Más de Akhilesh Joshi

Más de Akhilesh Joshi (11)

random forest regression
random forest regressionrandom forest regression
random forest regression
 
decision tree regression
decision tree regressiondecision tree regression
decision tree regression
 
support vector regression
support vector regressionsupport vector regression
support vector regression
 
R square vs adjusted r square
R square vs adjusted r squareR square vs adjusted r square
R square vs adjusted r square
 
Design patterns
Design patternsDesign patterns
Design patterns
 
Bastion Host : Amazon Web Services
Bastion Host : Amazon Web ServicesBastion Host : Amazon Web Services
Bastion Host : Amazon Web Services
 
Design patterns in MapReduce
Design patterns in MapReduceDesign patterns in MapReduce
Design patterns in MapReduce
 
Google knowledge graph
Google knowledge graphGoogle knowledge graph
Google knowledge graph
 
Machine learning (domingo's paper)
Machine learning (domingo's paper)Machine learning (domingo's paper)
Machine learning (domingo's paper)
 
SoLoMo - Future of Marketing
SoLoMo - Future of MarketingSoLoMo - Future of Marketing
SoLoMo - Future of Marketing
 
Webcrawler
WebcrawlerWebcrawler
Webcrawler
 

Último

Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
HyderabadDolls
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
gajnagarg
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
gajnagarg
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
HyderabadDolls
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
Health
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Klinik kandungan
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
ranjankumarbehera14
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Bertram Ludäscher
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 

Último (20)

Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Kings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themKings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about them
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 

Data preprocessing for Machine Learning with R and Python

  • 2. Dataset Country Age Salary Purchased France 44 72000 No Spain 27 48000 Yes Germany 30 54000 No Spain 38 61000 No Germany 40 Yes France 35 58000 Yes Spain 52000 No France 48 79000 Yes Germany 50 83000 No France 37 67000 Yes
  • 4. File Reading from directory in python • from tkinter import * • from tkinter.filedialog import askopenfilename • root = Tk() • root.withdraw() • root.update() • file_path = askopenfilename() • root.destroy()
  • 5. Importing the libraries • import numpy as np • import matplotlib.pyplot as plt • import pandas as pd
  • 6. Importing the dataset • dataset = pd.read_csv('Data.csv') • X = dataset.iloc[:, :-1].values • y = dataset.iloc[:, 3].values
  • 7. missing data • from sklearn.preprocessing import Imputer • imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0) • imputer = imputer.fit(X[:, 1:3]) • X[:, 1:3] = imputer.transform(X[:, 1:3])
  • 8. Encoding categorical data • from sklearn.preprocessing import LabelEncoder, OneHotEncoder • labelencoder_X = LabelEncoder() • X[:, 0] = labelencoder_X.fit_transform(X[:, 0]) • onehotencoder = OneHotEncoder(categorical_features = [0]) • X = onehotencoder.fit_transform(X).toarray()
  • 9. Encoding the Dependent Variable • labelencoder_y = LabelEncoder() • y = labelencoder_y.fit_transform(y)
  • 10. Splitting into Training set and Test set • from sklearn.cross_validation import train_test_split • X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)
  • 11. Feature Scaling from sklearn.preprocessing import StandardScaler sc_X = StandardScaler() X_train = sc_X.fit_transform(X_train) X_test = sc_X.transform(X_test) NOTE : Apply feature scaling after splitting the data and it is because the following • Split it, then scale. Imagine it this way: you have no idea what real-world data looks like, so you couldn't scale the training data to it. Your test data is the surrogate for real- world data, so you should treat it the same way. • To reiterate: Split, scale your training data, then use the scaling from your training data on the testing data.
  • 13. # Data Preprocessing Python # Importing the libraries import numpy as np import matplotlib.pyplot as plt import pandas as pd # Importing the dataset dataset = pd.read_csv('Data.csv') X = dataset.iloc[:, :-1].values y = dataset.iloc[:, 3].values # Taking care of missing data from sklearn.preprocessing import Imputer imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0) imputer = imputer.fit(X[:, 1:3]) X[:, 1:3] = imputer.transform(X[:, 1:3]) # Splitting the dataset into the Training set and Test set from sklearn.cross_validation import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0) # Feature Scaling from sklearn.preprocessing import StandardScaler sc_X = StandardScaler() X_train = sc_X.fit_transform(X_train) X_test = sc_X.transform(X_test) sc_y = StandardScaler() y_train = sc_y.fit_transform(y_train)
  • 14. R
  • 15. R : Importing the dataset dataset = read.csv('Data.csv')
  • 16. R : missing data • dataset$Age = ifelse(is.na(dataset$Age), ave(dataset$Age, FUN = function(x) mean(x, na.rm = TRUE)), dataset$Age) • dataset$Salary = ifelse(is.na(dataset$Salary), ave(dataset$Salary, FUN = function(x) mean(x, na.rm = TRUE)), dataset$Salary)
  • 17. Encoding categorical data • dataset$Country = factor(dataset$Country, levels = c('France', 'Spain', 'Germany’), labels = c(1, 2, 3)) • dataset$Purchased = factor(dataset$Purchased, levels = c('No', 'Yes’), labels = c(0, 1))
  • 18. R : Splitting Training set and Test set • PACKAGES : • install.packages('caTools') • library(caTools) • set.seed(123) split = sample.split(dataset$DependentVariable, SplitRatio = 0.8) training_set = subset(dataset, split == TRUE) test_set = subset(dataset, split == FALSE)
  • 19. R: Feature Scaling training_set = scale(training_set) test_set = scale(test_set) NOTE : we cant apply the feature scaling to categorical data in R like python. Here we have to apply feature selection to only non categorical features. So our code becomes : training_set[, 2:3] = scale(training_set [, 2:3]) test_set = scale(test_set [, 2:3])
  • 20. # Data Preprocessing R # Importing the dataset dataset = read.csv('Data.csv') # Taking care of missing data dataset$Age = ifelse(is.na(dataset$Age), ave(dataset$Age, FUN = function(x) mean(x, na.rm = TRUE)), dataset$Age) dataset$Salary = ifelse(is.na(dataset$Salary), ave(dataset$Salary, FUN = function(x) mean(x, na.rm = TRUE)), dataset$Salary) # Splitting the dataset into the Training set and Test set # install.packages('caTools') library(caTools) set.seed(123) split = sample.split(dataset$DependentVariable, SplitRatio = 0.8) training_set = subset(dataset, split == TRUE) test_set = subset(dataset, split == FALSE) # Feature Scaling training_set = scale(training_set) test_set = scale(test_set)