An basic introduction of Machine learning and a kick start to model building process using Linear Regression. Covers fundamentals of Data Science field called Machine Learning covering the fundamental topic of supervised learning method called linear regression. Importantly it covers this using R language and throws light on how to interpret linear regression results of a model. Interpretation of results , tuning and accuracy metrics like RMSE Root Mean Squared Error are covered here.
2. Introducing the Speaker
• Girish Gore : 10+Years of Experience in Data Analytics / Data Science
• B.E. Computer Science fromVIT Pune , M.S. from BITS Pilani
• SpentTime on Data Products Mainly In companies like
• Cognizant (InnovationsGroup)
• SAS (Pricing & Revenue Management)
• VuClip (Video Entertainment)
• Shoptimize (E-Commerce)
• Worked in fields like
• Text Mining
• Forecasting and Optimization
• Recommender Systems
4. UnderstandingTerminologies
Artificial Intelligence
AI involves machines that can perform tasks that are characteristic of human
intelligence.
Machine Learning
Machine learning is an application of artificial intelligence (AI) that provides
systems the ability to automatically learn and improve from experience without
being explicitly programmed.
Deep Learning
Deep Learning is an attempt to mimic the workings of the brain. Deep
Learning is one of many approaches to machine learning
6. Traditional Programming vs Machine Learning
• If Programming automates processes ,
Machine Learning automates Program
generation i.e. Automation.
• Data and output is run on the computer to
create a program.This program can be used
in traditional programming
7. What is Machine Learning ?
• Machine Learning is
• study of algorithms that
• improve their performance at a particular task
• with experience ( previous data , output)
• Optimize a performance criterion using example data or past experience
• Role of Computer Science : Efficient Algorithms
• Solve the optimization problem
• Represent and Evaluate the model for inference
8. Why are we here Now !!! GoogleTrends !!
• Exponential increase in Data generation , accumulation
• Increasing computational power
• Growing progress in available algorithms and Research
• Software becoming too complex to write by hand
9. Common Applications of Machine Learning
• Web search: ranking page based on what you are most likely to click on.
• Finance: decide who to send what credit card offers to. Evaluation of risk on credit
offers. How to decide where to invest money.
• E-commerce: Predicting customer churn.Whether or not a transaction is fraudulent.
• Robotics: how to handle uncertainty in new environments.Autonomous. Self-driving car.
• Information extraction:Ask questions over databases across the web.
• Social networks: Data on relationships and preferences. Machine learning to extract value
from data.
• Debugging: Use in computer science especially in Labor intensive processes like
debugging. Could suggest where the bug could be
• Gaming, IBMWatson
10. Types Of Machine Learning
• Learning Associations
• Supervised Learning
• Regression
• Classification
• Un Supervised Learning
• Reinforcement Learning
• Semi supervised Learning
• Training data includes a few desired outputs. Between supervised and un supervised
11. Learning Associations
• Market Basket analysis:
P (Y | X ) probability that somebody who buys X also buys Y where X and Y
are products/services.
Example: P ( diaper| beer ) = 0.7
TransactionID BasketItems
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper,Coke
12. Learning Associations
• Support : The probability of the customer buying diaper and beer together
among all sales transactions (Higher support the better)
• Confidence : Suppose that if a customer pick up diaper. How he/she is likely
to buy beer? (Closer to 1 better)
• Lift : Lift is a true comparison between naive model and our model,
meaning that how more likely a customer buy both, compared to buy
separately? (Lift > 1)
13. Supervised Learning
• Supervised Learning is a Machine Learning task of inferring a generalized function
from labelled training data. Training data includes desired outputs.
Example: Spam Detection , Credit Scoring , Face Detection
• In Supervised Learning for spam detection we have
• Email Contents with Labels marking Spam or Non Spam
• Task is to label newer emails
• Main two types of Supervised Learning Problems
• Regression
• Classification
14. Supervised Learning
• Regression Problems
• Maps input data to a continuous prediction variable
• Example: Predicting Retail house prices (Price as continues variable)
• Classification Problems
• Maps input data to a set of predefined classes
• Example: Benign or MalignantTumours
15. Regression : House Price Prediction
• We have historic data about size of house and the price for last 1 year
• Task is to predict the Price of House given its size
•Model Derivation:
Price = Slope of Line * Size + Constant
16. Classification : Credit Scoring
We have labelled data of low and high risk customers.
Task is differentiating between low-risk and high-risk customers from their
income and savings.
Model Derivation:
IF income > θ1 AND savings > θ2
THEN low-risk ELSE high-risk
17. Un Supervised Learning
• Training data does not include desired output.
Task is to find hidden structure in unlabeled data
• CommonApproaches to Un Supervised Learning
• Clustering or Segmentation ( Customer Segmentation)
• Dimensionality Reduction ( PCA (Principal ComponentAnalysis) , SVD
(SingularValue Decomposition))
• Summarization
18. Un Supervised Learning
• Customer Segmentation: Help marketers discover distinct groups in their customer bases,
and then use this knowledge to develop targeted marketing programs.
• The clustering algorithm
forms 3 different groups of
customers to target.
19. Reinforcement Learning
• Learning from interaction with the environment to achieve a goal.
Rewards from a sequence of actions.
• Every Action has either a
• Reward OR
• Observation
• Examples
• Self Driving Cars
• Recommender Systems
•Stanford Research Link
https://www.cs.utexas.edu/~eladlieb/RLRG.html
22. Linear Regression
22
• In statistics, linear regression is an approach for modeling the
relationship between a scalar dependent variable y and one or more
explanatory variables (or independent variables) denoted X
• The case of one explanatory variable is called simple linear
regression
• For more than one explanatory
variable, the process is
called multiple linear regression
https://en.wikipedia.org/wiki/Linear_regression
23. From School Book :
Linear Equations
Y
Y = mX + b
b = Y-intercept
X
Change
in Y
Change in X
m = Slope
24. Linear Regression : A Common Example
24
Ohm’s Law:
• In physics, it is observed that the relationship between Voltage (V), Current (I)
and Resistance (R) is a linear relationship expressedas
V = I * R
I = V / R
• In a circuit board for a given Resistance R,
as you increase the VoltageV,
the Current I increases proprotionately
http://www.electronics-tutorials.ws/dccircuits/dcp_1.html
25. Sample Monthly Income-Expense Data of a Household
25
Monthly Income
(in Rs.)
Monthly Expense
(in Rs.)
5,000 8,000
6,000 7,000
10,000 4,500
10,000 2,000
12,500 12,000
14,000 8,000
15,000 16,000
18,000 20,000
19,000 9,000
20,000 9,000
20,000 18,000
22,000 25,000
23,400 5,000
24,000 10,500
24,000 10,000
We have to find the relationship between Income and Expenses
of a household
y = 0.3008x + 6319.1
R² = 0.4215
0
40000
30000
20000
10000
50000
60000
MonthlyExpense
Monthly Income
Income Vs. Expense
26. Line of Best Fit
26
0
10000
20000
30000
40000
50000
60000
MonthlyExpense
Monthly Income
IncomeVs.Expense
Which of these lines best
describe the relationship
between Household Income
and Expenses ?
27. 27
0
10000
20000
30000
40000
50000
60000
MonthlyExpense
Monthly Income
Income Vs. Expense
The Line of Best Fit will be the
one where Sum of Square of
Error (SSE) term will be
nique)
sample
on
)
)
get
Xi
X
b =
)2
ii
i i i i
nX -(
X Y
21
minimum (OLSTech
Err or (em = ym - ym)
Yi(hat) = bo + b1Xi isthe
regression equati
SSE = ei(hat
2 (1)
)
= (Yi -Y(i(hat))2 (2
= (Yi - bo - b1Xi)2 (3
Using calculus we
Error (en)
Yi -b1
bo =
n
n XY -
Line of Best Fit
28. Least Squares
• ‘Best Fit’ Means Difference Between ActualYValues & PredictedYValues is
a Minimum. But Positive Differences Off-Set Negative ones. So square
errors!
• LS Minimizes the Sum of the Squared Differences (errors) (SSE)
n
i
i
n
i
ii YY
1
2
1
2
ˆˆ
29. Simple Linear Regression in R
29
### CODE SNIPPET ###
?cars
# Investigating the basics of the data set
str(cars)
attributes(cars)
30. Examining the data
30
### CODE SNIPPET ###
# How speed and distance value summaries look. NA’s ?
summary(cars)
# Is there a correlation between speed and time to stop
cor(cars$speed, cars$dist)
31. Plotting the data
31
### CODE SNIPPET ###
plot(cars, main=“Distance between Speed and Distance to Stop”)
scatter.smooth(cars,lpars = list(col = "red", lwd = 3 , lty = 3))
boxplot(cars$dist, main="Outliers for Distance")
plot(density(cars$speed) , main="Density Distribution of Speed" ,
type="h",col="blue")
32. Basic Linear Model
32
### CODE SNIPPET ###
linear_model = lm(dist ~ speed , data=cars)
summary(linear_model)
33. CoefficientAnalysis
33
• Coefficient - Estimate
• Y intercept given is -17.5791
• Every 1 mph increase in the speed of a car, the required distance to stop goes up by 3.9324 feet.
• Coefficient - Standard Error
• The coefficient Standard Error measures the average amount that the coefficient estimates vary from
the actual average value of our response variable.We’d ideally want a lower number relative to its
coefficients.
• Coefficient - t value
• The coefficient t-value is a measure of how many standard deviations our coefficient estimate is far
away from 0.We want it to be far away from zero as this would indicate we could reject the null
hypothesis - that is, we could declare a relationship between speed and distance exist. In general, t-
values are also used to compute p-values.
• Coefficient - Pr(>t)
• A small p-value for the intercept and the slope indicates that we can reject the null hypothesis which
allows us to conclude that there is a relationship between speed and distance.
36. Residual Standard Error
36
• Residual Standard Error is measure of the quality of a linear
regression fit.
• The Residual Standard Error is the average amount that the response
(dist) will deviate from the true regression line.
• In our example, the actual distance required to stop can deviate from
the true regression line by approximately 15.3795867 feet, on
average. (Which is ~ 3.93 * 4 times)
• The Residual Standard Error was calculated with 48 degrees of
freedom. Simplistically, degrees of freedom are the number of data
points that went into the estimation of the parameters
37. Coefficient of Determination
• In statistics, the coefficient of determination, denoted R2 or r2 and pronounced
"R squared", is a number that indicates the proportion of the variance in the
dependent variable that is predictable from the independent variable(s)
• The R2 we get is 0.6511. Roughly 65% of the variance found in the response
variable (distance) can be explained by the predictor variable (speed)
• R2 value significance is relative to domain , Adjusted R2 used for multi linear
https://en.wikipedia.org/wiki/Coefficient_of_determination
38. F Statistics & PValue
• Indicator of whether there is a relationship between our predictor and the
response variables
• Greater than 1 suggests we can reject the null hypothesis : No relation between
speed and distance exists
• We can consider a linear model to be statistically significant only when both
these p-Values are less that the pre-determined statistical significance level,
which is ideally 0.05
40. What allWe did ?
• Examined the data
• Plotting the data
• Simple Linear Regression Model Creation
• Co efficient Analysis
• Residual Analysis
• R2 Analysis
• F Statistics
Is the current state of model good to be deployed /
used on live ?
41. Evaluation of Model : SplitTrain /Test
### CODE SNIPPET ###
## 80% of the sample size
sample_size <- floor(0.80 * nrow(cars))
## set the seed to make your partition reproductible
set.seed(123)
train_index <- sample(seq_len(nrow(cars)), size = sample_size)
train <- cars[ train_index, ]
test <- cars[-train_index, ]
linear_model_subset <- lm(dist ~ speed, data=train)
distPred <- predict(linear_model_subset, test)
summary(linear_model_subset)
plot(distPred, test$dist)
42. RMSE :To compare between models
### CODE SNIPPET ###
rmse <-function(error)
{
sqrt(mean(error^2))
}
print(rmse(test$dist - distPreds))
• RMSE : Root Mean Squared Error
• Average Distance between the observed values and the model predictions
OR
• How far are the residuals from zero
43. Food for thought !!!
Is the test / train split model the best
generalization we have ??
.. Covered in Upcoming Sessions