This document provides an introduction to using logistic regression in R to analyze case-control studies. It explains how to download and install R, perform basic operations and calculations, handle data, load libraries, and conduct both conditional and unconditional logistic regression. Conditional logistic regression is recommended for matched case-control studies as it provides unbiased results. The document demonstrates how to perform logistic regression on a lung cancer dataset to analyze the association between disease status and genetic and environmental factors.
2. What is R?
The R statistical programming language is a free open
source package.
The language is very powerful for writing programs.
Many statistical functions are already built in.
Contributed packages expand the functionality to
cutting edge research.
3. Getting Started
Go to www.r-project.org
Downloads: CRAN (Comprehensive R Archive
Network)
Set your Mirror: location close to you.
Select Windows 95 or later, MacOS or UNIX
platforms
5. Basic operators and calculations
Comparison operators
equal: ==
not equal: !=
greater/less than: > <
greater/less than or equal: >= <=
Example: 1 == 1 # Returns TRUE
6. Basic operators and calculations
Logical operators
AND: &
x <- 1:10; y <- 10:1 # Creates the sample vectors 'x' and 'y'.
x > y & x > 5 # Returns TRUE where both comparisons return TRUE.
OR: |
x == y | x != y # Returns TRUE where at least one comparison is
TRUE.
NOT: !
!x > y # The '!' sign returns the negation (opposite) of a logical
vector.
7. Basic operators and calculations
Calculations
Four basic arithmetic functions: addition, subtraction,
multiplication and division
1 + 1; 1 - 1; 1 * 1; 1 / 1 # Returns results of basic arithmetic
calculations.
Calculations on vectors
x <- 1:10; sum(x); mean(x), sd(x); sqrt(x) # Calculates for
the vector x its sum, mean, standard deviation and square root.
x <- 1:10; y <- 1:10; x + y # Calculates the sum for each element
in the vectors x and y.
8. R-Graphics
R provides comprehensive graphics utilities for
visualizing and exploring scientific data. It includes:
Scatter plots
Line plots
Bar plots
Pie charts
Heatmaps
Venn diagrams
Density plots
Box plots
9. Data handling in R
Load data: mydata = read.csv(“/path/mydata.csv”)
See data on screen: data(mydata)
See top part of data: head(mydata)
Specific number of rows and column of data:
mydata[1:10,1:3]
To get a type of data: class(mydata)
Changing class of data: newdata = as.matrix(mydata)
Summary of data: summary(mydata)
Selecting (KEEPING) variables (columns)
newdata = mydata[c(1,3:5)]
10. Data handling in R
Selecting observations
newdata= subset(mydata, age>=20 | age <10,
select=c(ID, weight)
newdata= subset(mydata, sex==“Male” & age >25,
select=weight:income)
Excluding (DROPPING) variables (columns)
newdata = mydata[c(-3,-5)]
mydata$v3 = NULL
11. R-Library
There are many tools defined as “package” are present in R for
different kind of analysis including data from genetics and
genomics.
Depending upon the availability of library, it can be
downloaded from two sources
Using CRAN (Comprehensive R Archive Network) as:
install.packages(“package_name”)
Using Bioconductor as:
source("http://bioconductor.org/biocLite.R")
biocLite(“package_name”)
12. R-Library
To load a package,
library() #Lists all libraries/packages that are available on a system.
library(genetics) #Package for genetics data analysis
library(help=genetics) #Lists all functions/objects of “genetics”
package
?function #Opens documentation of a function
13. What is Logistic Regression?
Logistic regression describes the relationship between
a dichotomous response variable and a set of
explanatory variables.
Logistic regression is often used because the
relationship between the DV (a discrete variable) and
a predictor is non-linear.
14. A General Model:
Logistic Regression
JJ
disease
disease
disease XX
p
p
p βββ +++=
−
= 110)
1
log()logit(
Where:
pdisease is the probability that an individual has a particular
disease.
β0 is the intercept
β1, β2 …βJ are the coefficients (effects) of genetic factors
X1, X2 …XJ are the variables of genetic factors
15. Assumptions
Logistic regression does not make any assumptions
of normality, linearity, and homogeneity of variance
for the independent variables.
Because it does not impose these requirements, it is
preferred to discriminant analysis when the data does
not satisfy these assumptions.
16. Questions ??
What is the relative importance of each predictor variable?
How does each predictor variable affect the outcome?
Does a predictor variable make the solution better or
worse or have no effect?
Are there interactions among predictors?
Does adding interactions among predictors
(continuous or categorical) improve the model?
What is the strength of association between the outcome
variable and a set of predictors?
Often in model comparison you want non-significant
differences so strength of association is reported for
even non-significant effects.
17. Types of Logistic Regression
Unconditional logistic regression
Conditional logistic regression
** Rule of thumbs
Use conditional logistic regression if matching has been done,
and unconditional if there has been no matching.
When in doubt, use conditional because it always gives
unbiased results. The unconditional method is said to
overestimate the odds ratio if it is not appropriate.
18. Data Format
Status Matset Se_Quartiles GPX1 GPX4 SEP15 TXN2
1 1 <60 CT TT AG AG
0 1 >60 – 70 CC CC GG GG
1 2 <60 TT CC AG AA
0 2 >70 – 80 CC CT GG GG
1 3 >80 CC CC AA AA
0 3 >60 – 70 CT TT GG GG
1 4 <60 CC CC AA AG
0 4 >70 – 80 TT TT GG GG
1 5 >80 CC CC AG AA
0 5 <60 CC CC GG GG
1 6 >70 – 80 CT TT AA AA
0 6 >80 CC CC GG AG
1 7 >60 – 70 TT CC AA AG
19. Data and Library loading
Load and use data in R (Using Lung cancer data from
PLoS One 2013, 8(3):e59051).
lung = read.csv(“/path/lung.csv”, sep= “t”, header = TRUE)
Load the library and use data for analysis
library(epicalc)
use(lung)
26. Something More
Changing the default reference
GPX1 = relevel(GPX1, ref = "TT")
pack()
Saving the result
result = clogistic.display(clogit_lung)
write.csv(result$table, file=“path/result.csv“, sep = “t”)
write.table(result$table, file=“path/result.xls“, sep = “t”)
27. Summary: regression models
Regression models can be used to describe the
average effect of predictors on outcomes in your data
set.
They can tell how likely that the effect is just be due
to chance.
They can look at each predictor “adjusting for” the
others (estimating what would happen if all others
were held constant.)
28. Thanks to,
Prof. Virasakdi Chongsuvivatwong
Epidemiology Unit,
Faculty of Medicine,
Prince of Songkla University, Thailand
Notas del editor
Coeffcients are calculated my MLE
In order to test hypotheses in logistic regression, we have used the likelihood ratio test and the Wald test.
If the confidence interval includes 0 we can say that there is no significant difference between the means of the two populations, at a given level of confidence. The width of the confidence interval gives us some idea about how uncertain we are about the difference in the means. A very wide interval may indicate that more data should be collected before anything definite can be said. A confidence interval that includes 1.0 means that the association between the exposure and outcome could have been found by chance alone and that the association is not statistically significant.
Binomial is specifying a choice of variance and link functions. Variance is binomial and link is logit function.