SlideShare una empresa de Scribd logo
1 de 23
Data Mining with R
Regression models
Hamideh Iraj
Hamideh.iraj@ut.ac.ir
Slides Reference

This a curation from:
Data Analysis Course
Weeks 4-5-6
https://www.coursera.org/course/dataanalysis
Galton Data – Introduction
library(UsingR)
data(galton)
---------------------------------Head(galton)
Tail(galton)
---------------------------------Dim(galton)
Str(galton)
summary(galton)
summary(galton$child)
Galton Data - Plotting
par(mfrow=c(1,2))
hist(galton$child,col="blue",breaks=100)

hist(galton$parent,col="blue",breaks=100)
Galton Data – Plotting
pairs(galton)

- cont.
What is Regression Analysis?
regression analysis is a statistical process for estimating the
relationships among variables. It includes many techniques for
modeling and analyzing several variables, when the focus is on the
relationship between a dependent variable and one or
more independent variables.

http://en.wikipedia.org/wiki/Regression_analysis
Fitting a line

 plot(galton$child, galton$parent, pch=19,col="blue")
 lm1 <- lm(child ~ parent, data=galton)
 lines(galton$parent,lm1$fitted,col="red", lwd=3)

The line
width
Plot Residuals
plot(galton$parent,lm1$residuals,col="blue",pch=19)
Abline (c(0,0),col="red",lwd=3)
Linear Model Coefficients
>Summary(lm1)
lm1$coeff
Why care about model Accuracy?

http://en.wikipedia.org/wiki/Linear_regression
Model Accuracy Measures
P-value
Confidence Interval
R2
Adjusted R2
P-value
Most Common Measure of Statistical Significance
Idea: Suppose nothing is going on - how unusual is it to see the estimate we got?
Some typical values (single test)

 P < 0.05 (significant)
 P < 0.01 (strongly significant)
 P < 0.001 (very significant)
Confidence intervals
A confidence interval is a type of interval estimate of a population
parameter and is used to indicate the reliability of an estimate
confint(lm1,level=0.95)

http://en.wikipedia.org/wiki/Confidence_interval
2
R
R2 : the proportion of response variation "explained" by the
regressors in the model.
R2= 1 :the fitted model explains all variability in
R2 = 0 indicates no 'linear' relationship (for straight line
regression, this means that the straight line model is a constant line
(slope=0, intercept=bar{y}) between the response variable and
regressors).

http://en.wikipedia.org/wiki/Coefficient_of_determination
Adjusted R2
The use of an adjusted R2 (often written as bar R^2 and pronounced
"R bar squared") is an attempt to take account of the phenomenon
of the R2 automatically and spuriously increasing when extra
explanatory variables are added to the model.

http://en.wikipedia.org/wiki/Coefficient_of_determination
Predicting with Linear Regression
coef(lm1)[1] + coef(lm1)[2]*80

newdata <- data.frame(parent=80)
predict(lm1,newdata)
Multivariate Linear Regression
WHO childhood hunger data
Dataset:
http://apps.who.int/gho/athena/data/GHO/WHOSIS_000008.csv?pr
ofile=text&filter=COUNTRY:*
hunger <- read.csv("./hunger.csv")
hunger <- hunger[hunger$Sex!="Both sexes", ]
Multivariate Linear Regression

– cont.

lmBoth <- lm(hunger$Numeric ~ hunger$Year + hunger$Sex)
lmBoth2 <- lm(hunger$Numeric ~ hunger$Year + hunger$Sex +
hunger$Sex*hunger$Year)
Same slopes

Different
slopes
Model Selection
step(lmBoth2)
Regression with Factor Variables
 Outcome is still quantitative
 Covariate(s) are factor variables
 Fitting lines = fitting means
 Want to evaluate contribution of all factor levels at once
Regression with Factor Variables – cont.
 Dataset: http://www.rossmanchance.com/iscam2/data/movies03RT.txt
 movies <- read.table("./movies.txt",sep="t",header=T,quote="")
 head(movies)
Regression with Factor Variables – cont.
 lm2 <- lm(movies$score ~ as.factor(movies$rating))
 summary(lm2)
Data mining with R- regression models

Más contenido relacionado

Destacado

Introduction to Deducer
Introduction to DeducerIntroduction to Deducer
Introduction to Deducer
Kazuki Yoshida
 
SUNG PARK PREDICT 422 Group Project Presentation
SUNG PARK PREDICT 422 Group Project PresentationSUNG PARK PREDICT 422 Group Project Presentation
SUNG PARK PREDICT 422 Group Project Presentation
Sung Park
 
Quantifying Text Sentiment in R
Quantifying Text Sentiment in RQuantifying Text Sentiment in R
Quantifying Text Sentiment in R
Rajarshi Guha
 
Automatic extraction of microorganisms and their habitats from free text usin...
Automatic extraction of microorganisms and their habitats from free text usin...Automatic extraction of microorganisms and their habitats from free text usin...
Automatic extraction of microorganisms and their habitats from free text usin...
Catherine Canevet
 

Destacado (20)

March meet up new delhi users- Two R GUIs Rattle and Deducer
March meet up new delhi users- Two R GUIs Rattle and DeducerMarch meet up new delhi users- Two R GUIs Rattle and Deducer
March meet up new delhi users- Two R GUIs Rattle and Deducer
 
Introduction to Deducer
Introduction to DeducerIntroduction to Deducer
Introduction to Deducer
 
R and data mining
R and data miningR and data mining
R and data mining
 
R user group presentation
R user group presentationR user group presentation
R user group presentation
 
Predictshine
PredictshinePredictshine
Predictshine
 
SUNG PARK PREDICT 422 Group Project Presentation
SUNG PARK PREDICT 422 Group Project PresentationSUNG PARK PREDICT 422 Group Project Presentation
SUNG PARK PREDICT 422 Group Project Presentation
 
Text Mining with R for Social Science Research
Text Mining with R for Social Science ResearchText Mining with R for Social Science Research
Text Mining with R for Social Science Research
 
Quantifying Text Sentiment in R
Quantifying Text Sentiment in RQuantifying Text Sentiment in R
Quantifying Text Sentiment in R
 
Automatic extraction of microorganisms and their habitats from free text usin...
Automatic extraction of microorganisms and their habitats from free text usin...Automatic extraction of microorganisms and their habitats from free text usin...
Automatic extraction of microorganisms and their habitats from free text usin...
 
Twitter Hashtag #appleindia Text Mining using R
Twitter Hashtag #appleindia Text Mining using RTwitter Hashtag #appleindia Text Mining using R
Twitter Hashtag #appleindia Text Mining using R
 
Computing Probabilities With R: mining the patterns in lottery
Computing Probabilities With R: mining the patterns in lotteryComputing Probabilities With R: mining the patterns in lottery
Computing Probabilities With R: mining the patterns in lottery
 
Introduction to R
Introduction to RIntroduction to R
Introduction to R
 
Text mining with R-studio
Text mining with R-studioText mining with R-studio
Text mining with R-studio
 
My Data Analysis Portfolio (Text Mining)
My Data Analysis Portfolio (Text Mining)My Data Analysis Portfolio (Text Mining)
My Data Analysis Portfolio (Text Mining)
 
Data mining with Rattle For R
Data mining with Rattle For RData mining with Rattle For R
Data mining with Rattle For R
 
Installing R and R-Studio
Installing R and R-StudioInstalling R and R-Studio
Installing R and R-Studio
 
R and Rcmdr Statistical Software
R and Rcmdr Statistical SoftwareR and Rcmdr Statistical Software
R and Rcmdr Statistical Software
 
Twitter Text Mining with Web scraping, R, Shiny and Hadoop - Richard Sheng
Twitter Text Mining with Web scraping, R, Shiny and Hadoop - Richard Sheng Twitter Text Mining with Web scraping, R, Shiny and Hadoop - Richard Sheng
Twitter Text Mining with Web scraping, R, Shiny and Hadoop - Richard Sheng
 
R-Studio Vs. Rcmdr
R-Studio Vs. RcmdrR-Studio Vs. Rcmdr
R-Studio Vs. Rcmdr
 
Data Exploration and Visualization with R
Data Exploration and Visualization with RData Exploration and Visualization with R
Data Exploration and Visualization with R
 

Similar a Data mining with R- regression models

Guide for building GLMS
Guide for building GLMSGuide for building GLMS
Guide for building GLMS
Ali T. Lotia
 

Similar a Data mining with R- regression models (20)

Logistic Regression in Case-Control Study
Logistic Regression in Case-Control StudyLogistic Regression in Case-Control Study
Logistic Regression in Case-Control Study
 
Loops and functions in r
Loops and functions in rLoops and functions in r
Loops and functions in r
 
[M3A3] Data Analysis and Interpretation Specialization
[M3A3] Data Analysis and Interpretation Specialization [M3A3] Data Analysis and Interpretation Specialization
[M3A3] Data Analysis and Interpretation Specialization
 
Data Science Meetup: DGLARS and Homotopy LASSO for Regression Models
Data Science Meetup: DGLARS and Homotopy LASSO for Regression ModelsData Science Meetup: DGLARS and Homotopy LASSO for Regression Models
Data Science Meetup: DGLARS and Homotopy LASSO for Regression Models
 
Statistics for Data Analytics
Statistics for Data AnalyticsStatistics for Data Analytics
Statistics for Data Analytics
 
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
 
Lab 2: Classification and Regression Prediction Models, training and testing ...
Lab 2: Classification and Regression Prediction Models, training and testing ...Lab 2: Classification and Regression Prediction Models, training and testing ...
Lab 2: Classification and Regression Prediction Models, training and testing ...
 
Logistic regression, machine learning algorithms
Logistic regression, machine learning algorithms Logistic regression, machine learning algorithms
Logistic regression, machine learning algorithms
 
Guide for building GLMS
Guide for building GLMSGuide for building GLMS
Guide for building GLMS
 
Regression kriging
Regression krigingRegression kriging
Regression kriging
 
Data exploration and graphics with R
Data exploration and graphics with RData exploration and graphics with R
Data exploration and graphics with R
 
Data fitting in Scilab - Tutorial
Data fitting in Scilab - TutorialData fitting in Scilab - Tutorial
Data fitting in Scilab - Tutorial
 
Eli plots visualizing innumerable number of correlations
Eli plots   visualizing innumerable number of correlationsEli plots   visualizing innumerable number of correlations
Eli plots visualizing innumerable number of correlations
 
Next Generation Programming in R
Next Generation Programming in RNext Generation Programming in R
Next Generation Programming in R
 
ARIMA Models - [Lab 3]
ARIMA Models - [Lab 3]ARIMA Models - [Lab 3]
ARIMA Models - [Lab 3]
 
working with python
working with pythonworking with python
working with python
 
1. linear model, inference, prediction
1. linear model, inference, prediction1. linear model, inference, prediction
1. linear model, inference, prediction
 
Regression in Modal Logic
Regression in Modal LogicRegression in Modal Logic
Regression in Modal Logic
 
Supervised Learning.pdf
Supervised Learning.pdfSupervised Learning.pdf
Supervised Learning.pdf
 
Machine Learning Unit 3 Semester 3 MSc IT Part 2 Mumbai University
Machine Learning Unit 3 Semester 3  MSc IT Part 2 Mumbai UniversityMachine Learning Unit 3 Semester 3  MSc IT Part 2 Mumbai University
Machine Learning Unit 3 Semester 3 MSc IT Part 2 Mumbai University
 

Más de Hamideh Iraj

Más de Hamideh Iraj (20)

Understanding Students' Engagement with Personalised Feedback Messages
Understanding Students' Engagement with Personalised Feedback MessagesUnderstanding Students' Engagement with Personalised Feedback Messages
Understanding Students' Engagement with Personalised Feedback Messages
 
Internet magazines practical tips for improving website traffic
Internet magazines  practical tips for improving website trafficInternet magazines  practical tips for improving website traffic
Internet magazines practical tips for improving website traffic
 
The alignment of e commerce strategies with corporate strategy a case study
The alignment of e commerce strategies with corporate strategy a case studyThe alignment of e commerce strategies with corporate strategy a case study
The alignment of e commerce strategies with corporate strategy a case study
 
The story of learning in university an introduction to connectivism
The story of learning in university an introduction to connectivismThe story of learning in university an introduction to connectivism
The story of learning in university an introduction to connectivism
 
Persian presentation risk management in it projects
Persian presentation  risk management in it projectsPersian presentation  risk management in it projects
Persian presentation risk management in it projects
 
Persian presentation causal comparative research method
Persian presentation causal comparative research methodPersian presentation causal comparative research method
Persian presentation causal comparative research method
 
Persian Document ITBSC
Persian Document   ITBSCPersian Document   ITBSC
Persian Document ITBSC
 
Persian presentation applying knowledge based education to reach knowledge ...
Persian presentation   applying knowledge based education to reach knowledge ...Persian presentation   applying knowledge based education to reach knowledge ...
Persian presentation applying knowledge based education to reach knowledge ...
 
Persian presentation the dance of change
Persian presentation   the dance of changePersian presentation   the dance of change
Persian presentation the dance of change
 
Membean word roots
Membean word rootsMembean word roots
Membean word roots
 
Persian presentation understanding the roi of change management
Persian presentation understanding the roi of change management Persian presentation understanding the roi of change management
Persian presentation understanding the roi of change management
 
Persian document maintenance and repair in cement industry
Persian document   maintenance and repair in cement industryPersian document   maintenance and repair in cement industry
Persian document maintenance and repair in cement industry
 
Persian notes burrel and morgan classification
Persian notes   burrel and morgan classificationPersian notes   burrel and morgan classification
Persian notes burrel and morgan classification
 
Persian notes elm chist
Persian notes  elm chistPersian notes  elm chist
Persian notes elm chist
 
Persian notes four paradigms of information systems development
Persian notes  four paradigms of information systems developmentPersian notes  four paradigms of information systems development
Persian notes four paradigms of information systems development
 
Persian presentation maintenance and repair in cement industry
Persian presentation   maintenance and repair in cement industryPersian presentation   maintenance and repair in cement industry
Persian presentation maintenance and repair in cement industry
 
Servqual Theory - persian
Servqual Theory - persianServqual Theory - persian
Servqual Theory - persian
 
Organizational Learning - Persian
Organizational Learning - PersianOrganizational Learning - Persian
Organizational Learning - Persian
 
Expectation confirmation Theory - persian
Expectation confirmation Theory - persianExpectation confirmation Theory - persian
Expectation confirmation Theory - persian
 
Evaluation theories conclusion - English
Evaluation theories conclusion - EnglishEvaluation theories conclusion - English
Evaluation theories conclusion - English
 

Data mining with R- regression models

  • 1. Data Mining with R Regression models Hamideh Iraj Hamideh.iraj@ut.ac.ir
  • 2. Slides Reference This a curation from: Data Analysis Course Weeks 4-5-6 https://www.coursera.org/course/dataanalysis
  • 3. Galton Data – Introduction library(UsingR) data(galton) ---------------------------------Head(galton) Tail(galton) ---------------------------------Dim(galton) Str(galton) summary(galton) summary(galton$child)
  • 4. Galton Data - Plotting par(mfrow=c(1,2)) hist(galton$child,col="blue",breaks=100) hist(galton$parent,col="blue",breaks=100)
  • 5. Galton Data – Plotting pairs(galton) - cont.
  • 6. What is Regression Analysis? regression analysis is a statistical process for estimating the relationships among variables. It includes many techniques for modeling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables. http://en.wikipedia.org/wiki/Regression_analysis
  • 7. Fitting a line  plot(galton$child, galton$parent, pch=19,col="blue")  lm1 <- lm(child ~ parent, data=galton)  lines(galton$parent,lm1$fitted,col="red", lwd=3) The line width
  • 10. Why care about model Accuracy? http://en.wikipedia.org/wiki/Linear_regression
  • 12. P-value Most Common Measure of Statistical Significance Idea: Suppose nothing is going on - how unusual is it to see the estimate we got? Some typical values (single test)  P < 0.05 (significant)  P < 0.01 (strongly significant)  P < 0.001 (very significant)
  • 13. Confidence intervals A confidence interval is a type of interval estimate of a population parameter and is used to indicate the reliability of an estimate confint(lm1,level=0.95) http://en.wikipedia.org/wiki/Confidence_interval
  • 14. 2 R R2 : the proportion of response variation "explained" by the regressors in the model. R2= 1 :the fitted model explains all variability in R2 = 0 indicates no 'linear' relationship (for straight line regression, this means that the straight line model is a constant line (slope=0, intercept=bar{y}) between the response variable and regressors). http://en.wikipedia.org/wiki/Coefficient_of_determination
  • 15. Adjusted R2 The use of an adjusted R2 (often written as bar R^2 and pronounced "R bar squared") is an attempt to take account of the phenomenon of the R2 automatically and spuriously increasing when extra explanatory variables are added to the model. http://en.wikipedia.org/wiki/Coefficient_of_determination
  • 16. Predicting with Linear Regression coef(lm1)[1] + coef(lm1)[2]*80 newdata <- data.frame(parent=80) predict(lm1,newdata)
  • 17. Multivariate Linear Regression WHO childhood hunger data Dataset: http://apps.who.int/gho/athena/data/GHO/WHOSIS_000008.csv?pr ofile=text&filter=COUNTRY:* hunger <- read.csv("./hunger.csv") hunger <- hunger[hunger$Sex!="Both sexes", ]
  • 18. Multivariate Linear Regression – cont. lmBoth <- lm(hunger$Numeric ~ hunger$Year + hunger$Sex) lmBoth2 <- lm(hunger$Numeric ~ hunger$Year + hunger$Sex + hunger$Sex*hunger$Year) Same slopes Different slopes
  • 20. Regression with Factor Variables  Outcome is still quantitative  Covariate(s) are factor variables  Fitting lines = fitting means  Want to evaluate contribution of all factor levels at once
  • 21. Regression with Factor Variables – cont.  Dataset: http://www.rossmanchance.com/iscam2/data/movies03RT.txt  movies <- read.table("./movies.txt",sep="t",header=T,quote="")  head(movies)
  • 22. Regression with Factor Variables – cont.  lm2 <- lm(movies$score ~ as.factor(movies$rating))  summary(lm2)