Linear Regression Algorithm | Linear Regression in R | Data Science Training | Edureka

www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Linear Regression

What Will You Learn Today?
What is Regression?Machine Learning Types Of Regression
Linear Regression -
Example
Linear Regression –
Use Cases
Demo In R: Real
Estate Use Case
1 2 3
4 65

Machine Learning

Introduction To Machine Learning
Machine learning is a type of artificial intelligence (AI) that provides computers with the ability to learn without
being explicitly programmed.
Training Data Learn
Algorithm
Build Model Perform
Feedback

Machine Learning - Example
 Facebook's News Feed uses machine learning to
personalize each member's feed.
 When you upload photos to Facebook, the service
automatically highlights faces and suggests friends
to tag.
 Facebook also uses AI(Artificial Intelligence) to
personalize
• Newsfeeds
• Advertisements
• Trending news
• Friend recommendations

What Is Regression?

What Is Regression?
 Regression analysis is a predictive
modelling technique.
 It estimates the relationship between
a dependent (target) and an
independent variable (predictor).
X-axis
Y-axis
Input value = 7.00
Predicted outcome = 123.9

Types Of Regression

Types Of Regression
Linear Regression
• When there is a linear
relationship between
independent and dependent
variables.
• When the dependent variable is
binary (0/ 1, True/ False, Yes/ No) in
nature.
Logistic Regression Polynomial Regression
• When the power of independent
variable is more than 1.
X
Y
X
Y

Linear Regression - Introduction
 The linear regression model assumes a linear
relationship between the input variables and
the outcome variable.
 This relationship can be expressed as
Where, y = outcome variable
x = input variables
= random error
= slope of the line
= intercept
y = β0 + β1x + ε
β1
β0
ε

I have a dataset consisting of height
and weight of students. Let’s see how
would linear regression fit into it.

Model Description
Scatterplot of y vs. x
We have a dataset of 10 students. We will
use it to draw scatterplot between height
and weight:
127
121
142
157
162
156
169
165
181
208
0
50
100
150
200
250
62 64 66 68 70 72 74 76
WEIGHT
HEIGHT

Model Description
Scatterplot of height vs. weight
Now, the natural question arises — "what is the best fitting line?"
 The prediction error (or residual error) is:
Where,
• yi is the observed value of the unit i (i.e,
students).
• ŷ is the predicted response (or fitted value) for
unit i
 The goal is to minimize the sum of the squared
prediction errors (Least squared error or LER)
ei = yi - ŷ
𝑄 = 𝑖=1
𝑛
(𝑦i − ŷ )2
127
121
142
157
162
156
169
165
181
208
0
50
100
150
200
250
62 64 66 68 70 72 74 76
WEIGHT
HEIGHT
W2= -331.2 + 7.1h
W1= -266.5 + 6.1h

Model Description
 Least squared error
(LER) w1 = 597.4
 Least squared error
(LER) w2 = 766.5
W2 = -331.2 + 7.1h W1 = -266.5 + 6.1h

Model Description
The solid line represented by
w = -266.53 +6.1376 will be the
best fit line as least squared
error is minimum for it.
127
121
142
157
162
156
169
165
181
208
0
50
100
150
200
250
62 64 66 68 70 72 74 76
WEIGHT
HEIGHT
W1= -266.5 + 6.1h

Now, lets understand linear
regression further with the
help of a simple example.

Linear Regression - Example
Here, Dependent
variable is
Churn_out_rate
And Independent
variable is Salary_hike
Let’s take an example,
A company is facing high churnout this year, salary hike being one of the major
reason.
So let us consider a company’s data where we will find out the relationship
between these two variables.

Linear Regression - Example
> plot(Salary_hike, Churn_out_rate)
x-axis = Salary_hike
y-axis = Churn_out_rate
Conclusion:
From the graph, we can see that as the Salary hike increases, the Churn out rate
decreases.
Salary hike

Linear Regression - Use Cases
Real Estate
Demand
forecasting
Real Estate
To model residential home prices as a function of
the home's living area, bathrooms, number of
bedrooms, lot size.
Medicine
To analyze the effect of a proposed radiation
treatment on reducing tumor sizes based on
patient attributes such as age or weight.
Demand forecasting
To predict demand for goods and services. For
example, restaurant chains can predict the quantity
of food depending on weather.
Marketing
To predict company’s sales based on previous
month’s sales and stock prices of a company.
Use-cases
Marketing
Medicine

Real Estate Consultation firm has the data
comprising price of apartments in Boston.
Based on this data, company wants to
decide the price of new apartments.

Demo

Demo
Data Acquisition
Divide dataset
Exploratory Analysis
Implement Model
Optimize Model
Model Validation
Let’s use the inbuilt housing data of Boston for linear regression analysis.
To load it we can use following code:
 library(MASS)
 data(Boston)
The Boston Data looks like this:
Data acquisition
Divide dataset
Implement Model
Optimize Model
Prediction
Model Validation

Demo
For description of the data we can use
> ?Boston
It will contain details about the data such as
• No. of rows and column
• Attributes description
Lets move forward to see the description of attributes
Data acquisition
Divide dataset
Implement Model
Optimize Model
Prediction
Model Validation

Demo
Prediction
Data acquisition
Divide dataset
Implement Model
Optimize Model
Prediction
Model Validation
Description
The Boston data frame has 506 rows and 14 columns.

Demo
Optimize Model
Model Validation
We will divide our entire dataset into two subsets as:
• Training dataset -> to train the model
• Testing dataset -> to validate and make predictions
Here we will divide the data in 7:3 ratio such that 70% will be
present as training set and remaining 30% as the testing set.Prediction
Data acquisition
Divide dataset
Implement Model
Optimize Model
Prediction
Model Validation

Demo
Let’s find the relation among all the variables through scatterplot matrix.
 library(lattice)
 splom(~Boston[c(1:6,14)], groups=NULL, data=Boston,axis.line.tck = 0,axis.text.alpha = 0)
 splom(~Boston[c(1:6,14)], groups=NULL, data=Boston,axis.line.tck = 0,axis.text.alpha = 0)
Let’s check the plots
Data acquisition
Divide dataset
Implement Model
Optimize Model
Prediction
Model Validation

Demo
Data Acquisition
Divide dataset
Implement Model
Optimize Model
Model Validation
 The plot shows positive
linear trend between
rm (average no. of
rooms) and medv
(value of home).
 No relevant relationship
between indus
(proportion of non-
retail business) and
medv
Data acquisition
Divide dataset
Implement Model
Optimize Model
Prediction
Model Validation
Data acquisition
Divide dataset
Implement Model
Optimize Model
Prediction
Model Validation

Demo
 The plot shows
negative linear trend
between lstat (lower
status of population)
and medv.
 No relevant relationship
between tax (property
tax rate) and medv
Data acquisition
Divide dataset
Implement Model
Optimize Model
Prediction
Model Validation

Demo
When we have various variables, correlation
is an important factor to check the
dependencies within themselves
Correlation analysis gives us an insight,
between mutual relationship among
variables.
To get correlation relationship among
different variables for a data set use following
code
> cr<- cor(Boston)
This will give us the correlation values.
For visualizing the same we can use corrplot()
function
> library(corrplot)
> corrplot(cr,type = "lower")
Data acquisition
Divide dataset
Implement Model
Optimize Model
Prediction
Model Validation

Demo
From the plot we can get visual relationship
among different variables:
• Dark blue signifies strong positive relationship
• Dark red signifies strong negative relationship
• Scale varies from red to blue, and size of the
circle varies according to correlation factor
Example:
medv and lstat have large negative
relationship
medv and rm have large positive relationship
Data acquisition
Divide dataset
Implement Model
Optimize Model
Prediction
Model Validation

Demo
 Multicollinearity exists when two or more
predictor are highly correlated among
themselves.
 When correlation among X’s is low, OLS has
lots of information to estimate.
 When correlation among X’s is high, OLS has
very little information to estimate. This makes
us relatively uncertain about our estimate.
X1
X2
Y
X1
X2
Y
Data acquisition
Divide dataset
Implement Model
Optimize Model
Prediction
Model Validation

Demo
How can I detect
multicollinearity ?
You can use VIF (variance
inflation factor) for it.
Let’s see how
Data acquisition
Divide dataset
Implement Model
Optimize Model
Prediction
Model Validation

Demo
Optimize Model
Variance inflation factor (VIF) measures the increase in the variance (the square
of the estimate's standard deviation) of an estimated regression coefficient due to
multicollinearity.
 A VIF of 1 means that there is no correlation among variables.
 Here, rad and tax have higher variance factor values indicating high multicollinearity.
 nox, indus and dis are moderately correlated.
Data acquisition
Divide dataset
Implement Model
Optimize Model
Prediction
Model Validation

Demo
Let’s check the correlation between rad and tax from corrgram.
 rad and tax are highly
correlated at 0.91
 We can remove one of the
predictors (rad or tax) to
remove multicollinearity
Data acquisition
Divide dataset
Implement Model
Optimize Model
Prediction
Model Validation

Demo
Let’s find the equation representing this best fit line
 summary(model)
As per the summary ,the equation representing our regression line is
medv= -34.671 + 9.102* rm
Data acquisition
Divide dataset
Implement Model
Optimize Model
Prediction
Model Validation

Demo
Now let’s build a model with the help of training set using the code below,
Here we will be using all variables excluding tax
 model<-lm(medv~ crim + zn + indus + chas + nox + rm + age + dis + rad +
ptratio + black + lstat,data = training_data)
Data acquisition
Divide dataset
Implement Model
Optimize Model
Prediction
Model Validation

Demo
Description of the model can be
found using Summary() function
> summary(model)
Some of the important values
are:
1. R-squared value
2. P-value
Data acquisition
Divide dataset
Implement Model
Optimize Model
Prediction
Model Validation

Demo
Here r-squared = 0.726
R-squared value indicates the
perfection of the predictive value.
If the R-squared value is closer to
1.0, then the Linear Model is best-
suited.
Data acquisition
Divide dataset
Implement Model
Optimize Model
Prediction
Model Validation

Demo
Model Validation
 High P values: your data are
likely with a true null.
 Low P values: your data are
unlikely with a true null.
 Here, indus and age relatively
higher in p- value, so they can
be neglected.
P values are used to determine statistical significance in a hypothesis test.
Data acquisition
Divide dataset
Implement Model
Optimize Model
Prediction
Model Validation

Demo
Now let’s build a model with the help of training set using the code
below, Here we will be excluding indus and age
> model<-lm(medv~ crim + zn + chas + nox + rm + dis + ptratio + black +
lstat,data = training_data)
Data acquisition
Divide dataset
Implement Model
Optimize Model
Prediction
Model Validation

Demo
Model Validation
Here, adjusted R-squared value
remained same despite of
removing indus and age from the
model.
Data acquisition
Divide dataset
Implement Model
Optimize Model
Prediction
Model Validation

Demo
Now we can use our model to predict the output of our testing dataset.
We can use the following code for predicting the output
> predic<-predict(model,test)
Data acquisition
Divide dataset
Implement Model
Optimize Model
Prediction
Model Validation

Demo
For comparing these values we can use plots
Here we plot a line graph where green lines represent the actual price and the
blue lines represent the predictive model generated for the data.
 plot(testing_data$medv,type = "l",lty = 1.8,col = "green")
 lines(predic,type = "l",col = "blue")
As we can see from the graph most of the predictive values are overlapping the
actual values.
Data acquisition
Divide dataset
Implement Model
Optimize Model
Prediction
Model Validation

Demo
I have this dataset. What
will be the estimated
cost of apartment?
Here’s the code
line and
predicted value
Data acquisition
Divide dataset
Implement Model
Optimize Model
Prediction
Model Validation

Course Details
Go to www.edureka.co/data-science
Get Edureka Certified in Data Science Today!
What our learners have to say about us!
Shravan Reddy says- “I would like to recommend any one who
wants to be a Data Scientist just one place: Edureka. Explanations
are clean, clear, easy to understand. Their support team works
very well.. I took the Data Science course and I'm going to take
Machine Learning with Mahout and then Big Data and Hadoop”.
Gnana Sekhar says - “Edureka Data science course provided me a very
good mixture of theoretical and practical training. LMS pre recorded
sessions and assignments were very good as there is a lot of
information in them that will help me in my job. Edureka is my
teaching GURU now...Thanks EDUREKA.”
Balu Samaga says - “It was a great experience to undergo and get
certified in the Data Science course from Edureka. Quality of the
training materials, assignments, project, support and other
infrastructures are a top notch.”

Linear Regression Algorithm | Linear Regression in R | Data Science Training | Edureka

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Linear Regression Algorithm | Linear Regression in R | Data Science Training | Edureka

Similar a Linear Regression Algorithm | Linear Regression in R | Data Science Training | Edureka (20)

Más de Edureka!

Más de Edureka! (20)

Último

Último (20)

Linear Regression Algorithm | Linear Regression in R | Data Science Training | Edureka

Notas del editor