2024-11-04
There appears to be a tendency for taller men to have higher FEV1, but FEV1 also decreases with age.
We need to have a robust model that can incorporate information from multiple variables at the same time.
Now it’s no longer a 2D regression line, but a p dimensional regression plane.
A regression plane in 3 dimensions: FEV1∼Height+Age
The mathematical model for multiple linear regression equates the value of the continuous outcome yi to a linear combination of multiple predictors x1…xp each with their own slope coefficient β1…βp.
yi=β0+β1x1i+…+βpxpi+ϵi
where i indexes the observations i=1…n, and j indexes the number of parameters j=1…p.
The assumptions on the residuals ϵi still hold:
Find the values of bj that minimize the difference between the value of the dependent variable predicted by the model ˆyi and the true value of the dependent variable yi.
^yi−yi where ˆyi=p∑i=1Xijbj
AKA: Minimize the sum of the squared residuals:
n∑i=1(yi−p∑i=1Xijbj)2
Fitting a regression model in R with more than 1 predictor is done by adding each variable to the right hand side of the model notation connected with a +
.
# A tibble: 3 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -2.76 1.14 -2.43 1.65e- 2
2 FAGE -0.0266 0.00637 -4.18 4.93e- 5
3 FHEIGHT 0.114 0.0158 7.25 2.25e-11
2.5 % 97.5 %
(Intercept) -5.00919751 -0.51229620
FAGE -0.03922545 -0.01405323
FHEIGHT 0.08319434 0.14559974
Characteristic | Beta (95% CI)1 | p-value |
---|---|---|
(Intercept) | -2.8 (-5.0 to -0.51) | 0.016 |
FAGE | -0.03 (-0.04 to -0.01) | <0.001 |
FHEIGHT | 0.11 (0.08 to 0.15) | <0.001 |
Adjusted R² | 0.325 | |
No. Obs. | 150 | |
1 CI = Confidence Interval |
The corresponding regression equation now is
ˆyi=−2.76−0.027(age)+0.114(height)
Intercept
The intercept is interpreted as the predicted outcome when all covariates are set to 0.
ˆyi=−2.76−0.027(age)+0.114(height)
A father who is 0 years old, and is 0 inches tall has an expected FEV of -2.76L
This number that does not make any sense whatsoever. This is often the case, and why regression output tables tend to not show the intercept.
Slope
bj is the estimated change in Y for a 1 unit increase in xj while holding the value of all other variables constant. Can also be phrased as “after controlling for other predictors..”
ˆyi=−2.76−0.027(age)+0.114(height)
“Slope” as a difference in groups
bj is the effect of being in group (xj=1) compared to being in the reference (xj=0) group.
Let’s look at how biological sex may impact or change the relationship between FEV and either height or age. The regression model now is:
yi=β0+β1x1i+β2x2i+β3x3i+ϵi
where
❌ Do not manually change the variable to binary 0/1 numeric!
Base R
Call:
lm(formula = FEV1 ~ AGE + HEIGHT + BIOL.SEX, data = fev_long)
Residuals:
Min 1Q Median 3Q Max
-1.36151 -0.29656 -0.00458 0.30785 1.42583
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.240511 0.751707 -2.981 0.00312 **
AGE -0.023539 0.004068 -5.786 1.83e-08 ***
HEIGHT 0.105089 0.010527 9.983 < 2e-16 ***
BIOL.SEXFemale -0.637746 0.078324 -8.142 1.09e-14 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4776 on 296 degrees of freedom
Multiple R-squared: 0.6494, Adjusted R-squared: 0.6459
F-statistic: 182.8 on 3 and 296 DF, p-value: < 2.2e-16
tbl_regression
Characteristic | Beta (95% CI)1 | p-value |
---|---|---|
(Intercept) | -2.2 (-3.7 to -0.76) | 0.003 |
AGE | -0.02 (-0.03 to -0.02) | <0.001 |
HEIGHT | 0.11 (0.08 to 0.13) | <0.001 |
BIOL.SEX | ||
Male | — | |
Female | -0.64 (-0.79 to -0.48) | <0.001 |
Adjusted R² | 0.646 | |
No. Obs. | 300 | |
1 CI = Confidence Interval |
In this model BIOL.SEX
is a categorical variable with levels Male
and Female
, where Male
is the first ordered level. The estimate shown is for females compared to males.
The fitted regression equation for the model with gender is
ˆy=−2.24−0.02∗AGE+0.11∗HEIGHT−0.64∗I(BIOL.SEX==‘Female‘)
Still can use template language
The interpretation of categorical variables still falls under the template language of “for every one unit increase in Xp, Y changes by bp”. So a 1 “unit” change is females (X3=1) compared to males (X3=0).
Let’s consider the effect of city environment on FEV. For those unfamiliar with the region, these cities represent very different environmental profiles.
Characteristic | N = 300 |
---|---|
AREA, n (%) | |
Burbank | 48 (16) |
Lancaster | 98 (33) |
Long Beach | 38 (13) |
Glendora | 116 (39) |
I do not do anything to the variable AREA
itself aside from add it into the model.
Base R
Call:
lm(formula = FEV1 ~ AGE + HEIGHT + BIOL.SEX + AREA, data = fev_long)
Residuals:
Min 1Q Median 3Q Max
-1.32809 -0.29573 0.00578 0.31588 1.37041
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.250564 0.752414 -2.991 0.00302 **
AGE -0.022801 0.004151 -5.493 8.59e-08 ***
HEIGHT 0.103866 0.010555 9.841 < 2e-16 ***
BIOL.SEXFemale -0.642168 0.078400 -8.191 8.10e-15 ***
AREALancaster 0.031549 0.084980 0.371 0.71072
AREALong Beach 0.061963 0.104057 0.595 0.55199
AREAGlendora 0.121589 0.082097 1.481 0.13967
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4777 on 293 degrees of freedom
Multiple R-squared: 0.6529, Adjusted R-squared: 0.6458
F-statistic: 91.86 on 6 and 293 DF, p-value: < 2.2e-16
tbl_regression
Characteristic | Beta (95% CI)1 | p-value |
---|---|---|
(Intercept) | -2.3 (-3.7 to -0.77) | 0.003 |
AGE | -0.02 (-0.03 to -0.01) | <0.001 |
HEIGHT | 0.10 (0.08 to 0.12) | <0.001 |
BIOL.SEX | ||
Male | — | |
Female | -0.64 (-0.80 to -0.49) | <0.001 |
AREA | ||
Burbank | — | |
Lancaster | 0.03 (-0.14 to 0.20) | 0.71 |
Long Beach | 0.06 (-0.14 to 0.27) | 0.55 |
Glendora | 0.12 (-0.04 to 0.28) | 0.14 |
Adjusted R² | 0.646 | |
No. Obs. | 300 | |
1 CI = Confidence Interval |
AREA AREALancaster AREALong Beach AREAGlendora
1 "Burbank" "0" "0" "0"
185 "Glendora" "0" "0" "1"
49 "Lancaster" "1" "0" "0"
150 "Long Beach" "0" "1" "0"
AREALancaster
variable and 0 otherwise.Do not have to manually do this
This is automatically done by the software for you in most programs
Y=β0+β1∗xage+β2xht+β3xsex+β4x4+β5x5+β6x6
where
AREA='Lancaster'
, and 0 otherwiseAREA='Long Beach'
, and 0 otherwiseAREA='Glendora'
, and 0 otherwiseThe coefficients for the other levels of the categorical variable are interpreted as the effect of that variable on the outcome in compared to the reference level.
Characteristic | Beta (95% CI)1 | p-value |
---|---|---|
(Intercept) | -2.3 (-3.7 to -0.77) | 0.003 |
AGE | -0.02 (-0.03 to -0.01) | <0.001 |
HEIGHT | 0.10 (0.08 to 0.12) | <0.001 |
BIOL.SEX | ||
Male | — | |
Female | -0.64 (-0.80 to -0.49) | <0.001 |
AREA | ||
Burbank | — | |
Lancaster | 0.03 (-0.14 to 0.20) | 0.71 |
Long Beach | 0.06 (-0.14 to 0.27) | 0.55 |
Glendora | 0.12 (-0.04 to 0.28) | 0.14 |
Adjusted R² | 0.646 | |
No. Obs. | 300 | |
1 CI = Confidence Interval |
None of these differences are significant, so we can conclude that after controlling for age, height and biological sex, residential area does not have an effect on the persons lung function. This can also be noticed by the lack of increase in the R2 value compared to the simpler model
🔗 https://math615.netlify.app / Multiple Linear Regression