An introduction to logistic regression for physicians, public health students and other health workers. Logistic regression is a way to look at effect of a numeric independent variable on a binary (yes-no) dependent variable. For example, you can analyze or model the effect of birth weight on survival.
2. Logistic RegressionLogistic Regression
A way to look at effect ofA way to look at effect of
– ““Numeric” (interval or ratio) independentNumeric” (interval or ratio) independent
variablevariable
OnOn
– BinaryBinary (yes-no) dependent variable(yes-no) dependent variable
3. Dependent variable is continuousDependent variable is continuous intervalinterval oror
ratioratio (numeric)(numeric)
Independent variables are also interval orIndependent variables are also interval or
ratioratio
ExamplesExamples
– Effect of weight on blood pressureEffect of weight on blood pressure
– Effect of drug dose on reticulocyte countEffect of drug dose on reticulocyte count
Review Linear RegressionReview Linear Regression
6. Logistic RegressionLogistic Regression
Dependent variable is binary (yes/no) outcome.Dependent variable is binary (yes/no) outcome.
Independent variables are continuous intervalIndependent variables are continuous interval
Examples:Examples:
– Relation of weight and BP to 10 year risk of deathRelation of weight and BP to 10 year risk of death
– Relation of CD4 count to 1 year risk of AIDS diagnosisRelation of CD4 count to 1 year risk of AIDS diagnosis
7. Why do we need it?Why do we need it?
Could use categorical analysis such as frequency tableCould use categorical analysis such as frequency table
AIDSAIDS No AIDSNo AIDS
CD4 > 350CD4 > 350 8080 2020
150 < CD4 < 350150 < CD4 < 350 5050 5050
CD4 < 150CD4 < 150 2020 8080
• Problems
a) some information is lost when we collapse the
numeric data into categories. This leads to loss
of power.
b) no estimate of magnitude of relation
8. Odds RatioOdds Ratio
Probability:Probability:
p = probability of eventp = probability of event
1 - p = probabilty of1 - p = probabilty of notnot the event (also called q)the event (also called q)
p varies from 0 to 1p varies from 0 to 1
OddsOdds
– Ratio of probability of event to probability of notRatio of probability of event to probability of not
having the event: Odds = p/(1 - p)having the event: Odds = p/(1 - p)
– When p = 0.5, odds = 1 (or “1:1 odds”)When p = 0.5, odds = 1 (or “1:1 odds”)
– When p = 0.1, odds = 0.1/0.9 = 0.11When p = 0.1, odds = 0.1/0.9 = 0.11
9. Log Odds RatioLog Odds Ratio
The log odds ratio (also called “logit”) is simply the naturalThe log odds ratio (also called “logit”) is simply the natural
logarithm of the odds ratio:logarithm of the odds ratio:
¤ logitlogit = ln(odds ratio)= ln(odds ratio)
= ln(p/(1-p))= ln(p/(1-p))
= ln(p) – ln(1-p)= ln(p) – ln(1-p)
ln (1) = 0, so logit is 0 when odds are 1:1, orln (1) = 0, so logit is 0 when odds are 1:1, or
probability = 50%probability = 50%
The logit for event of probability p is the opposite of the logitThe logit for event of probability p is the opposite of the logit
for the probability of not having the event.for the probability of not having the event.
10. Relation between probability p and logit
0.000
0.250
0.500
0.750
1.000
-8 -6 -4 -2 0 2 4 6 8
logit = ln[p/(1-p)]
11. Logistic regression modelLogistic regression model
The linear regression model with one variableThe linear regression model with one variable
isis
y = a + bx + ey = a + bx + e
The logistic regression model with oneThe logistic regression model with one
variable isvariable is
logit = a + bx + elogit = a + bx + e
wherewhere
logit = ln(p/(1-p))logit = ln(p/(1-p))
12. The logistic regression model with oneThe logistic regression model with one
variable isvariable is
logit = a + bxlogit = a + bx where logit = ln(p/(1-p))where logit = ln(p/(1-p))
In other words, the model says the odds of the eventIn other words, the model says the odds of the event
happening arehappening are
– A constant factor (a)A constant factor (a)
– Some other constant (b)Some other constant (b)
– times a numeric risk factor (x) (for example, SBP)times a numeric risk factor (x) (for example, SBP)
Logistic regression modelLogistic regression model
13. Logistic regression modelLogistic regression model
Given value of the independent variables, theGiven value of the independent variables, the
regression equation predicts theregression equation predicts the
Log Odds RatioLog Odds Ratio
14. Logistic regression modelLogistic regression model
The statistics program calculates theThe statistics program calculates the
coefficient bcoefficient b
TheThe coefficient bcoefficient b shows how much the oddsshows how much the odds
ratio changes with a change in theratio changes with a change in the
independent variableindependent variable
Positive bPositive b higher risk with higher valueshigher risk with higher values
Negative bNegative b lower risk with higher valueslower risk with higher values
15. Logistic regression modelLogistic regression model
Hypothetical example given above examining relation of BP toHypothetical example given above examining relation of BP to
risk of stroke/death. The model predicts:risk of stroke/death. The model predicts:
ln(odds ratio) = constant + bln(odds ratio) = constant + b ∙ SBPSBP
ee(lnoddsratio)(lnoddsratio)
= e= e(c+b(c+b∙ SBP)SBP)
Odds RatioOdds Ratio == ee(c+b(c+b∙SBP)SBP)
== eecc
∙ ee(b(b∙SBP)SBP)
16. Logistic regression modelLogistic regression model
The coefficient b shows how much the odds ratioThe coefficient b shows how much the odds ratio
changes with a change in the independent variablechanges with a change in the independent variable
Odds RatioOdds Ratio == eecc
∙ ee(bx)(bx)
In other words,In other words,
Odds RatioOdds Ratio == somethingsomething ∙ (e(ebb
))(x)(x)
17. Logistic regression modelLogistic regression model
Odds RatioOdds Ratio = constant= constant ∙ ((eebb
))(x)(x)
SoSo eebb
is the factor indicating effect of x on theis the factor indicating effect of x on the
event.event.
Each one unit change in x will multiply the oddsEach one unit change in x will multiply the odds
ratio by a factor of eratio by a factor of ebb
..
18. Logistic regression modelLogistic regression model
Odds RatioOdds Ratio = constant= constant ∙ ((eebb
))(x)(x)
– Suppose b = 0.693 so eSuppose b = 0.693 so ebb
= 2= 2
– A one-unit change in x willA one-unit change in x will doubledouble the odds ratiothe odds ratio
– Suppose b = -0.693 so eSuppose b = -0.693 so ebb
= 0.5= 0.5
– A one-unit change in x willA one-unit change in x will halvehalve the odds ratio.the odds ratio.
– If b = 0, eIf b = 0, ebb
= 1, and x has no effect on OR= 1, and x has no effect on OR
19. Logistic regression modelLogistic regression model
For the hypothetical example above, the report isFor the hypothetical example above, the report is
given by Epi Info asgiven by Epi Info as
TermTerm OddsOdds
RatioRatio
95% CI95% CI CoeffCoeff S. E.S. E. ZZ PP
BPBP 1.05971.0597 1.0221.022 1.0981.098 0.05790.0579 0.01850.0185 3.1313.131 0.00170.0017
ConstConst ** ** ** -7.201-7.201 2.29942.2994 3.1313.131 0.00170.0017
20. Logistic regression modelLogistic regression model
TermTerm Odds RatioOdds Ratio 95% CI95% CI CoefficientCoefficient S. E.S. E. ZZ P-valueP-value
BPBP 1.05971.0597 1.0221.022 1.0981.098 0.05790.0579 0.0180.018 3.1313.131 0.00170.0017
ConstantConstant ** ** ** -7.2014-7.2014 2.2992.299 3.1313.131 0.00170.0017
Coefficient, or beta, or b, is the slope or magnitude
of the effect.
21. Logistic regression modelLogistic regression model
TermTerm OddsOdds
RatioRatio
95% CI95% CI CoefficientCoefficient S. E.S. E. ZZ P-valueP-value
BPBP 1.05971.0597 1.02201.0220 1.09871.0987 0.05790.0579 0.01850.0185 3.13193.1319 0.00170.0017
ConstantConstant ** ** ** -7.2014-7.2014 2.29942.2994 3.13193.1319 0.00170.0017
Odds ratio for one unit change in the
independent variable (e.g. BP). This is the
calculated eb
eb
A one unit change in BP multiplies the odds ratio by
1.0597.
22. Logistic regression modelLogistic regression model
TermTerm Odds RatioOdds Ratio 95% CI95% CI CoeffCoeff S. E.S. E. ZZ P-valueP-value
BPBP 1.05971.0597 1.0221.022 1.0981.098 0.05790.0579 0.01850.0185 3.13193.1319 0.00170.0017
ConstantConstant ** ** ** -7.2014-7.2014 2.29942.2994 3.13193.1319 0.00170.0017
95% confidence interval for that odds ratio.
The confidence interval does not include 1, so the
effect is statistically significant
23. Using more than one independentUsing more than one independent
variablevariable
Single variable:Single variable:
logit = c + bxlogit = c + bx
OR = c’ ∙ (eOR = c’ ∙ (ebb
))xx
Multiple variables:Multiple variables:
logit = c + blogit = c + b11xx11 + b+ b22xx22 + … + b+ … + bnnxxnn
OR = c’ ∙ (eOR = c’ ∙ (eb1b1
))x1x1
∙ (e∙ (eb2b2
))x2x2
∙ … ∙ (e∙ … ∙ (ebnbn
))xnxn
Note that the termsNote that the terms multiplymultiply their effect ontheir effect on
odds ratio.odds ratio.
24. Using more than one independentUsing more than one independent
variablevariable
Analysis reports a b coefficient for eachAnalysis reports a b coefficient for each
independent variable.independent variable.
That coefficient is the effect of the givenThat coefficient is the effect of the given
independent variable, separated from theindependent variable, separated from the
effects of all the other independent variables.effects of all the other independent variables.
25. Real Life ExampleReal Life Example
Prospective cohort study of causes ofProspective cohort study of causes of
cardiac disease: Evans County Study 1965cardiac disease: Evans County Study 1965
Independent variables = age, gender,Independent variables = age, gender,
race, social index, SBP, diabetes, smoking,race, social index, SBP, diabetes, smoking,
cholesterol, and an obesity indexcholesterol, and an obesity index
Dependent variable = risk of dying duringDependent variable = risk of dying during
10 year period10 year period
26. VariableVariable RangeRange b coeffb coeff SESE pp
ConstantConstant -6.376-6.376 1.6341.634 <0.001<0.001
AgeAge 40-69 y40-69 y 0.0860.086 0.1150.115 <0.001<0.001
GenderGender 0=m, 1=f0=m, 1=f 1.5001.500 0.9670.967 0.1210.121
Age x genderAge x gender -0.043-0.043 0.0170.017 0.0110.011
Social indexSocial index 20-8420-84 -0.056-0.056 0.0400.040 0.1600.160
(Soc ind)(Soc ind)22
400-7056400-7056 0.00060.0006 0.00030.0003 0.0820.082
SBPSBP 88-31088-310 0.0190.019 0.0020.002 <0.001<0.001
DiabetesDiabetes 0=n, 1=y0=n, 1=y 1.1231.123 0.2610.261 <0.001<0.001
SmokingSmoking 0=n, 1=y0=n, 1=y 0.3170.317 0.1570.157 0.0430.043
CholesterolCholesterol 94-54694-546 0.00310.0031 0.00150.0015 0.0410.041
QuartletQuartlet 2.11-8.762.11-8.76 -1.064-1.064 0.4320.432 0.0140.014
(Quartlet)(Quartlet)22
4.44-76.84.44-76.8 0.1120.112 0.0490.049 0.0220.022
Cited in Kelsey et al., Methods in Observational Epidemiology, 1986
28. Statistical SignificanceStatistical Significance
The p value indicates statistical significanceThe p value indicates statistical significance
Age is positively correlated with risk of deathAge is positively correlated with risk of death
Gender has positive b coefficient, but the p valueGender has positive b coefficient, but the p value
is 0.12, indicating that we cannot say that there isis 0.12, indicating that we cannot say that there is
a significant relationship.a significant relationship.
VariableVariable RangeRange b coeffb coeff SESE pp
AgeAge 40-69 y40-69 y 0.0860.086 0.1150.115 <0.001<0.001
GenderGender 0=m, 1=f0=m, 1=f 1.5001.500 0.9670.967 0.1210.121
29. Dichotomous (yes-no) variablesDichotomous (yes-no) variables
Gender is coded as 0 for male, 1 for femaleGender is coded as 0 for male, 1 for female
eebb
[e[e1.51.5
= 4.48] is change in OR for 1 unit change in gender,= 4.48] is change in OR for 1 unit change in gender,
i.e. OR for females relative to malesi.e. OR for females relative to males
eebb
for any dummy variable (coded 0-1) is the adjustedfor any dummy variable (coded 0-1) is the adjusted
OR for that risk factor, since “1 unit of change” =OR for that risk factor, since “1 unit of change” =
presence vs. absence of risk factorpresence vs. absence of risk factor
VariableVariable RangeRange b coeffb coeff SESE pp
ConstantConstant -6.376-6.376 1.6341.634 <0.001<0.001
AgeAge 40-69 y40-69 y 0.0860.086 0.1150.115 <0.001<0.001
GenderGender 0=m, 1=f0=m, 1=f 1.5001.500 0.9670.967 0.1210.121
30. Squared termsSquared terms
Social index squared is included as well asSocial index squared is included as well as
social index itself.social index itself.
Squared terms allow for curvilinearSquared terms allow for curvilinear
relationships, just as in ordinaryrelationships, just as in ordinary
regressionregression
VariableVariable RangeRange b coeffb coeff SESE pp
Age x genderAge x gender -0.043-0.043 0.0170.017 0.0110.011
Social indexSocial index 20-8420-84 -0.056-0.056 0.0400.040 0.1600.160
(Soc ind)(Soc ind)22
400-7056400-7056 0.00060.0006 0.00030.0003 0.0820.082
31. Interaction termsInteraction terms
Age and gender are entered into model asAge and gender are entered into model as
separate termsseparate terms
Age x gender included to see whether ageAge x gender included to see whether age
has different effect in males than inhas different effect in males than in
females.females.
VariableVariable RangeRange b coeffb coeff SESE pp
AgeAge 40-69 y40-69 y 0.0860.086 0.1150.115 <0.001<0.001
GenderGender 0=m, 1=f0=m, 1=f 1.5001.500 0.9670.967 0.1210.121
Age x genderAge x gender M: 0-0M: 0-0
F: 40-69F: 40-69
-0.043-0.043 0.0170.017 0.0110.011
32. InterpretationInterpretation
With binary, dummy variables, eWith binary, dummy variables, ebb
is the odds ratio.is the odds ratio.
You can compare the strength (slope) of the effectYou can compare the strength (slope) of the effect
by comparing b.by comparing b.
With numeric variables, b is not a direct measure ofWith numeric variables, b is not a direct measure of
strength of effect.strength of effect.
– Example: b is quite small in effect of BP on mortality,Example: b is quite small in effect of BP on mortality,
because it is the effect of onlybecause it is the effect of only one mmHgone mmHg change in BP. BPchange in BP. BP
is still an important factor in mortality because there is ais still an important factor in mortality because there is a
widewide rangerange in the BP.in the BP.
33. InterpretationInterpretation
In a prospective cohort study we can useIn a prospective cohort study we can use
logistic regression model to predictlogistic regression model to predict probabilityprobability
of the event given the independent variables.of the event given the independent variables.
Also can derive relative risk.Also can derive relative risk.
In a cross sectional study we only have theIn a cross sectional study we only have the
odds ratio.odds ratio.
34. Selection of variablesSelection of variables
Same principle as with ordinary regressionSame principle as with ordinary regression
Forward selection: add one variable at a timeForward selection: add one variable at a time
until there are no more that make a significantuntil there are no more that make a significant
differencedifference
Backward selection: start with all, remove oneBackward selection: start with all, remove one
at a time to see if they made a significantat a time to see if they made a significant
contributioncontribution
EPI Info has suggestions on how to do thisEPI Info has suggestions on how to do this