This Edureka Linear Regression tutorial will help you understand all the basics of linear regression machine learning algorithm along with examples. This tutorial is ideal for both beginners as well as professionals who want to learn or brush up their Data Science concepts. Below are the topics covered in this tutorial:
1) Introduction to Machine Learning
2) What is Regression?
3) Types of Regression
4) Linear Regression Examples
5) Linear Regression Use Cases
6) Demo in R: Real Estate Use Case
You can also take a complete structured training, check out the details here: https://goo.gl/AfxwBc
2. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
What Will You Learn Today?
What is Regression?Machine Learning Types Of Regression
Linear Regression -
Example
Linear Regression –
Use Cases
Demo In R: Real
Estate Use Case
1 2 3
4 65
4. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Introduction To Machine Learning
Machine learning is a type of artificial intelligence (AI) that provides computers with the ability to learn without
being explicitly programmed.
Training Data Learn
Algorithm
Build Model Perform
Feedback
5. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Machine Learning - Example
Facebook's News Feed uses machine learning to
personalize each member's feed.
When you upload photos to Facebook, the service
automatically highlights faces and suggests friends
to tag.
Facebook also uses AI(Artificial Intelligence) to
personalize
• Newsfeeds
• Advertisements
• Trending news
• Friend recommendations
7. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
What Is Regression?
Regression analysis is a predictive
modelling technique.
It estimates the relationship between
a dependent (target) and an
independent variable (predictor).
X-axis
Y-axis
Input value = 7.00
Predicted outcome = 123.9
9. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Types Of Regression
Linear Regression
• When there is a linear
relationship between
independent and dependent
variables.
• When the dependent variable is
binary (0/ 1, True/ False, Yes/ No) in
nature.
Logistic Regression Polynomial Regression
• When the power of independent
variable is more than 1.
X
Y
X
Y
11. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Linear Regression - Introduction
The linear regression model assumes a linear
relationship between the input variables and
the outcome variable.
This relationship can be expressed as
Where, y = outcome variable
x = input variables
= random error
= slope of the line
= intercept
y = β0 + β1x + ε
β1
β0
ε
13. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Model Description
Scatterplot of y vs. x
We have a dataset of 10 students. We will
use it to draw scatterplot between height
and weight:
127
121
142
157
162
156
169
165
181
208
0
50
100
150
200
250
62 64 66 68 70 72 74 76
WEIGHT
HEIGHT
14. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Model Description
Scatterplot of height vs. weight
Now, the natural question arises — "what is the best fitting line?"
The prediction error (or residual error) is:
Where,
• yi is the observed value of the unit i (i.e,
students).
• ŷ is the predicted response (or fitted value) for
unit i
The goal is to minimize the sum of the squared
prediction errors (Least squared error or LER)
ei = yi - ŷ
𝑄 = 𝑖=1
𝑛
(𝑦i − ŷ )2
127
121
142
157
162
156
169
165
181
208
0
50
100
150
200
250
62 64 66 68 70 72 74 76
WEIGHT
HEIGHT
W2= -331.2 + 7.1h
W1= -266.5 + 6.1h
15. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Model Description
Least squared error
(LER) w1 = 597.4
Least squared error
(LER) w2 = 766.5
W2 = -331.2 + 7.1h W1 = -266.5 + 6.1h
16. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Model Description
The solid line represented by
w = -266.53 +6.1376 will be the
best fit line as least squared
error is minimum for it.
127
121
142
157
162
156
169
165
181
208
0
50
100
150
200
250
62 64 66 68 70 72 74 76
WEIGHT
HEIGHT
W1= -266.5 + 6.1h
18. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Linear Regression - Example
Here, Dependent
variable is
Churn_out_rate
And Independent
variable is Salary_hike
Let’s take an example,
A company is facing high churnout this year, salary hike being one of the major
reason.
So let us consider a company’s data where we will find out the relationship
between these two variables.
19. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Linear Regression - Example
> plot(Salary_hike, Churn_out_rate)
x-axis = Salary_hike
y-axis = Churn_out_rate
Conclusion:
From the graph, we can see that as the Salary hike increases, the Churn out rate
decreases.
Salary hike
20. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Linear Regression - Use Cases
Real Estate
Demand
forecasting
Real Estate
To model residential home prices as a function of
the home's living area, bathrooms, number of
bedrooms, lot size.
Medicine
To analyze the effect of a proposed radiation
treatment on reducing tumor sizes based on
patient attributes such as age or weight.
Demand forecasting
To predict demand for goods and services. For
example, restaurant chains can predict the quantity
of food depending on weather.
Marketing
To predict company’s sales based on previous
month’s sales and stock prices of a company.
Use-cases
Marketing
Medicine
21. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Real Estate Consultation firm has the data
comprising price of apartments in Boston.
Based on this data, company wants to
decide the price of new apartments.
23. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Demo
Data Acquisition
Divide dataset
Exploratory Analysis
Implement Model
Optimize Model
Model Validation
Let’s use the inbuilt housing data of Boston for linear regression analysis.
To load it we can use following code:
library(MASS)
data(Boston)
The Boston Data looks like this:
Data acquisition
Divide dataset
Exploratory Analysis
Implement Model
Optimize Model
Prediction
Model Validation
24. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Demo
For description of the data we can use
> ?Boston
It will contain details about the data such as
• No. of rows and column
• Attributes description
Lets move forward to see the description of attributes
Data acquisition
Divide dataset
Exploratory Analysis
Implement Model
Optimize Model
Prediction
Model Validation
25. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Demo
Prediction
Data acquisition
Divide dataset
Exploratory Analysis
Implement Model
Optimize Model
Prediction
Model Validation
Description
The Boston data frame has 506 rows and 14 columns.
26. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Demo
Optimize Model
Model Validation
We will divide our entire dataset into two subsets as:
• Training dataset -> to train the model
• Testing dataset -> to validate and make predictions
Here we will divide the data in 7:3 ratio such that 70% will be
present as training set and remaining 30% as the testing set.Prediction
Data acquisition
Divide dataset
Exploratory Analysis
Implement Model
Optimize Model
Prediction
Model Validation
27. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Demo
Let’s find the relation among all the variables through scatterplot matrix.
library(lattice)
splom(~Boston[c(1:6,14)], groups=NULL, data=Boston,axis.line.tck = 0,axis.text.alpha = 0)
splom(~Boston[c(1:6,14)], groups=NULL, data=Boston,axis.line.tck = 0,axis.text.alpha = 0)
Let’s check the plots
Data acquisition
Divide dataset
Exploratory Analysis
Implement Model
Optimize Model
Prediction
Model Validation
28. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Demo
Data Acquisition
Divide dataset
Exploratory Analysis
Implement Model
Optimize Model
Model Validation
The plot shows positive
linear trend between
rm (average no. of
rooms) and medv
(value of home).
No relevant relationship
between indus
(proportion of non-
retail business) and
medv
Data acquisition
Divide dataset
Exploratory Analysis
Implement Model
Optimize Model
Prediction
Model Validation
Data acquisition
Divide dataset
Exploratory Analysis
Implement Model
Optimize Model
Prediction
Model Validation
29. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Demo
The plot shows
negative linear trend
between lstat (lower
status of population)
and medv.
No relevant relationship
between tax (property
tax rate) and medv
Data acquisition
Divide dataset
Exploratory Analysis
Implement Model
Optimize Model
Prediction
Model Validation
30. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Demo
When we have various variables, correlation
is an important factor to check the
dependencies within themselves
Correlation analysis gives us an insight,
between mutual relationship among
variables.
To get correlation relationship among
different variables for a data set use following
code
> cr<- cor(Boston)
This will give us the correlation values.
For visualizing the same we can use corrplot()
function
> library(corrplot)
> corrplot(cr,type = "lower")
Data acquisition
Divide dataset
Exploratory Analysis
Implement Model
Optimize Model
Prediction
Model Validation
31. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Demo
From the plot we can get visual relationship
among different variables:
• Dark blue signifies strong positive relationship
• Dark red signifies strong negative relationship
• Scale varies from red to blue, and size of the
circle varies according to correlation factor
Example:
medv and lstat have large negative
relationship
medv and rm have large positive relationship
Data acquisition
Divide dataset
Exploratory Analysis
Implement Model
Optimize Model
Prediction
Model Validation
32. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Demo
Multicollinearity exists when two or more
predictor are highly correlated among
themselves.
When correlation among X’s is low, OLS has
lots of information to estimate.
When correlation among X’s is high, OLS has
very little information to estimate. This makes
us relatively uncertain about our estimate.
X1
X2
Y
X1
X2
Y
Data acquisition
Divide dataset
Exploratory Analysis
Implement Model
Optimize Model
Prediction
Model Validation
33. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Demo
How can I detect
multicollinearity ?
You can use VIF (variance
inflation factor) for it.
Let’s see how
Data acquisition
Divide dataset
Exploratory Analysis
Implement Model
Optimize Model
Prediction
Model Validation
34. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Demo
Optimize Model
Variance inflation factor (VIF) measures the increase in the variance (the square
of the estimate's standard deviation) of an estimated regression coefficient due to
multicollinearity.
A VIF of 1 means that there is no correlation among variables.
Here, rad and tax have higher variance factor values indicating high multicollinearity.
nox, indus and dis are moderately correlated.
Data acquisition
Divide dataset
Exploratory Analysis
Implement Model
Optimize Model
Prediction
Model Validation
35. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Demo
Let’s check the correlation between rad and tax from corrgram.
rad and tax are highly
correlated at 0.91
We can remove one of the
predictors (rad or tax) to
remove multicollinearity
Data acquisition
Divide dataset
Exploratory Analysis
Implement Model
Optimize Model
Prediction
Model Validation
36. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Demo
Let’s find the equation representing this best fit line
summary(model)
As per the summary ,the equation representing our regression line is
medv= -34.671 + 9.102* rm
Data acquisition
Divide dataset
Exploratory Analysis
Implement Model
Optimize Model
Prediction
Model Validation
37. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Demo
Now let’s build a model with the help of training set using the code below,
Here we will be using all variables excluding tax
model<-lm(medv~ crim + zn + indus + chas + nox + rm + age + dis + rad +
ptratio + black + lstat,data = training_data)
Data acquisition
Divide dataset
Exploratory Analysis
Implement Model
Optimize Model
Prediction
Model Validation
38. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Demo
Description of the model can be
found using Summary() function
> summary(model)
Some of the important values
are:
1. R-squared value
2. P-value
Data acquisition
Divide dataset
Exploratory Analysis
Implement Model
Optimize Model
Prediction
Model Validation
39. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Demo
Here r-squared = 0.726
R-squared value indicates the
perfection of the predictive value.
If the R-squared value is closer to
1.0, then the Linear Model is best-
suited.
Data acquisition
Divide dataset
Exploratory Analysis
Implement Model
Optimize Model
Prediction
Model Validation
40. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Demo
Model Validation
High P values: your data are
likely with a true null.
Low P values: your data are
unlikely with a true null.
Here, indus and age relatively
higher in p- value, so they can
be neglected.
P values are used to determine statistical significance in a hypothesis test.
Data acquisition
Divide dataset
Exploratory Analysis
Implement Model
Optimize Model
Prediction
Model Validation
41. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Demo
Now let’s build a model with the help of training set using the code
below, Here we will be excluding indus and age
> model<-lm(medv~ crim + zn + chas + nox + rm + dis + ptratio + black +
lstat,data = training_data)
Data acquisition
Divide dataset
Exploratory Analysis
Implement Model
Optimize Model
Prediction
Model Validation
42. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Demo
Model Validation
Here, adjusted R-squared value
remained same despite of
removing indus and age from the
model.
Data acquisition
Divide dataset
Exploratory Analysis
Implement Model
Optimize Model
Prediction
Model Validation
43. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Demo
Now we can use our model to predict the output of our testing dataset.
We can use the following code for predicting the output
> predic<-predict(model,test)
Data acquisition
Divide dataset
Exploratory Analysis
Implement Model
Optimize Model
Prediction
Model Validation
44. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Demo
For comparing these values we can use plots
Here we plot a line graph where green lines represent the actual price and the
blue lines represent the predictive model generated for the data.
plot(testing_data$medv,type = "l",lty = 1.8,col = "green")
lines(predic,type = "l",col = "blue")
As we can see from the graph most of the predictive values are overlapping the
actual values.
Data acquisition
Divide dataset
Exploratory Analysis
Implement Model
Optimize Model
Prediction
Model Validation
45. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Demo
I have this dataset. What
will be the estimated
cost of apartment?
Here’s the code
line and
predicted value
Data acquisition
Divide dataset
Exploratory Analysis
Implement Model
Optimize Model
Prediction
Model Validation
46. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Course Details
Go to www.edureka.co/data-science
Get Edureka Certified in Data Science Today!
What our learners have to say about us!
Shravan Reddy says- “I would like to recommend any one who
wants to be a Data Scientist just one place: Edureka. Explanations
are clean, clear, easy to understand. Their support team works
very well.. I took the Data Science course and I'm going to take
Machine Learning with Mahout and then Big Data and Hadoop”.
Gnana Sekhar says - “Edureka Data science course provided me a very
good mixture of theoretical and practical training. LMS pre recorded
sessions and assignments were very good as there is a lot of
information in them that will help me in my job. Edureka is my
teaching GURU now...Thanks EDUREKA.”
Balu Samaga says - “It was a great experience to undergo and get
certified in the Data Science course from Edureka. Quality of the
training materials, assignments, project, support and other
infrastructures are a top notch.”