Logistic Regression

lec10b

Author

Robin Donatello

Published

November 10, 2025

Binary outcome data

Consider an outcome variable \(Y\) with two levels: Y = 1 if event, = 0 if no event.

Let \(p_{i} = P(y_{i}=1)\).

Two goals:

Assess the impact selected covariates have on the probability of an outcome occurring.
Predict the probability of an event occurring given a certain covariate pattern.

Binary data can be modeled using a Logistic Regression Model

What are the Odds?

The odds are defined as the probability an event occurs divided by the probability it does not occur: \(\frac{p_{i}}{1-p_{i}}\).

The function \(ln\left(\frac{p_{i}}{1-p_{i}}\right)\) is also known as the log odds, or more commonly called the logit. This is the link function for the logistic regression model.

Link function

We use this logit function to transform a binary outcome (only 0 or 1) variable into a continuous probability (which only has a range from 0 to 1).

p <- seq(0, 1, by=.01)
logit.p <- log(p/(1-p))
qplot(logit.p, p, geom="line", xlab = "logit(p)", main="The logit transformation") + 
  theme_bw()

Logistic Regression

The logistic model then relates the probability of an event based on a linear combination of X’s.

\[ log\left( \frac{p_{i}}{1-p_{i}} \right) = \beta_{0} + \beta_{1}x_{1i} + \beta_{2}x_{2i} + \ldots + \beta_{p}x_{pi} \]

This means the relationship between \(X\) and the probability of success is nonlinear, but the relationship between \(X\) and the log-odds is linear.

Deriving the Odds Ratio (OR)

Logistic regression model: \(logit(y) = \beta_0 + \beta_1 X\)
The odds at \(X = x\) is \(e^{\beta_0 + \beta_1 x}\)
The odds at \(X = x+1\) is \(e^{\beta_0 + \beta_1 (x+1)} = e^{\beta_0 + \beta_1 x} * e^{\beta_1}\)
The **odds ratio (OR) for a 1 unit change in \(X\) is then \(e^{\beta_1}\)

The OR measures how the odds of success change for a one-unit increase in \(X\), holding other variables constant.

Interpreting the OR

Consider a binary outcome with values YES, coded as 1, and NO, coded as 0.

OR = 1 = equal chance of response variable being YES given any explanatory variable value.
OR > 1 = as the explanatory variable value increases, the presence of a YES response is more likely.
OR <1 = as the explanatory variable value increases, the presence of a YES response is less likely.

Confidence Intervals

The OR is not a linear function of the \(x's\), but \(\beta\) is.
This means that a CI for the OR is created by calculating a CI for \(\beta\), and then exponentiating the endpoints.
A 95% CI for the OR is calculated as:

\[e^{\hat{\beta} \pm 1.96 SE_{\beta}} \]

This math holds for any \(k\) unit change in x. The linearity of the confidence interval only applies at the untransformed level of the \(\beta\)’s. NOT the odds ratio.

Example: Depression

Let’s fit a model to examine the effect of identifying as female (gender) has on a depression (cases) diagnosis.

dep_sex_model <- glm(cases ~ sex, data=depress, family="binomial")
summary(dep_sex_model)


Call:
glm(formula = cases ~ sex, family = "binomial", data = depress)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -3.3511     0.6867  -4.880 1.06e-06 ***
sex           1.0386     0.3767   2.757  0.00583 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 268.12  on 293  degrees of freedom
Residual deviance: 259.40  on 292  degrees of freedom
AIC: 263.4

Number of Fisher Scoring iterations: 5

Calculate the Odds Ratio

We exponentiate the coefficients to back transform the \(\beta\) estimates into Odds Ratios

tbl_regression(dep_sex_model, exponentiate = TRUE)

Characteristic	OR	95% CI	p-value
sex	2.83	1.40, 6.21	0.006
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

Females have 2.8 (1.4, 6.2) times the odds of showing signs of depression compared to males (p = 0.006).

Important

note the multiplicative effect language “times the odds”, not just “higher odds”

Multiple Logistic Regression

Let’s continue with the depression model, but now also include age and income as potential predictors of symptoms of depression.

mvmodel <- glm(cases ~ age + income + sex, data=depress, family="binomial")
tbl_regression(mvmodel, exponentiate = TRUE)

Characteristic	OR	95% CI	p-value
age	0.98	0.96, 1.00	0.020
income	0.96	0.94, 0.99	0.009
sex	2.53	1.23, 5.66	0.016
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

The odds of a female being diagnosed with depression is 2.53 (1.23, 5.66) times greater than the odds for Males after adjusting for the effects of age and income (p=.016).

Model Fit

Pseudo \(R^{2}\) Not appropriate for logistic regression
Hosmer and Lemeshow (1980) “Goodness of Fit” take a “measure the residuals” approach to estimate how well the model fits the data.
- Implemented in the R package: MKmisc, function HLgof.test
Prediction Accuracy Using the model to calculate predicted probabilties of \(Y=1\), how often does the model prediction match the data?