2024-11-04
There appears to be a tendency for taller men to have higher FEV1, but FEV1 also decreases with age.
We need to have a robust model that can incorporate information from multiple variables at the same time.
Now it’s no longer a 2D regression line, but a \(p\) dimensional regression plane.
The mathematical model for multiple linear regression equates the value of the continuous outcome \(y_{i}\) to a linear combination of multiple predictors \(x_{1} \ldots x_{p}\) each with their own slope coefficient \(\beta_{1} \ldots \beta_{p}\).
\[ y_{i} = \beta_{0} + \beta_{1}x_{1i} + \ldots + \beta_{p}x_{pi} + \epsilon_{i}\]
where \(i\) indexes the observations \(i = 1 \ldots n\), and \(j\) indexes the number of parameters \(j=1 \ldots p\).
The assumptions on the residuals \(\epsilon_{i}\) still hold:
Find the values of \(b_j\) that minimize the difference between the value of the dependent variable predicted by the model \(\hat{y}_{i}\) and the true value of the dependent variable \(y_{i}\).
\[ \hat{y_{i}} - y_{i} \quad \mbox{ where } \quad \hat{y}_{i} = \sum_{i=1}^{p}X_{ij}b_{j}\]
AKA: Minimize the sum of the squared residuals:
\[ \sum_{i=1}^{n} (y_{i} - \sum_{i=1}^{p}X_{ij}b_{j})^{2}\]
Fitting a regression model in R with more than 1 predictor is done by adding each variable to the right hand side of the model notation connected with a +
.
# A tibble: 3 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -2.76 1.14 -2.43 1.65e- 2
2 FAGE -0.0266 0.00637 -4.18 4.93e- 5
3 FHEIGHT 0.114 0.0158 7.25 2.25e-11
2.5 % 97.5 %
(Intercept) -5.00919751 -0.51229620
FAGE -0.03922545 -0.01405323
FHEIGHT 0.08319434 0.14559974
Characteristic | Beta (95% CI)1 | p-value |
---|---|---|
(Intercept) | -2.8 (-5.0 to -0.51) | 0.016 |
FAGE | -0.03 (-0.04 to -0.01) | <0.001 |
FHEIGHT | 0.11 (0.08 to 0.15) | <0.001 |
Adjusted R² | 0.325 | |
No. Obs. | 150 | |
1 CI = Confidence Interval |
The corresponding regression equation now is
\[ \hat{y}_{i} = -2.76 - 0.027(age) + 0.114(height) \]
Intercept
The intercept is interpreted as the predicted outcome when all covariates are set to 0.
\[ \hat{y}_{i} = -2.76 - 0.027(age) + 0.114(height) \]
A father who is 0 years old, and is 0 inches tall has an expected FEV of -2.76L
This number that does not make any sense whatsoever. This is often the case, and why regression output tables tend to not show the intercept.
Slope
\(b_{j}\) is the estimated change in \(Y\) for a 1 unit increase in \(x_{j}\) while holding the value of all other variables constant. Can also be phrased as “after controlling for other predictors..”
\[ \hat{y}_{i} = -2.76 - 0.027(age) + 0.114(height) \]
“Slope” as a difference in groups
\(b_{j}\) is the effect of being in group (\(x_{j}=1\)) compared to being in the reference (\(x_{j}=0\)) group.
Let’s look at how biological sex may impact or change the relationship between FEV and either height or age. The regression model now is:
\[ y_{i} = \beta_{0} + \beta_{1}x_{1i} + \beta_{2}x_{2i} +\beta_{3}x_{3i} + \epsilon_{i}\]
where
❌ Do not manually change the variable to binary 0/1 numeric!
Base R
Call:
lm(formula = FEV1 ~ AGE + HEIGHT + BIOL.SEX, data = fev_long)
Residuals:
Min 1Q Median 3Q Max
-1.36151 -0.29656 -0.00458 0.30785 1.42583
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.240511 0.751707 -2.981 0.00312 **
AGE -0.023539 0.004068 -5.786 1.83e-08 ***
HEIGHT 0.105089 0.010527 9.983 < 2e-16 ***
BIOL.SEXFemale -0.637746 0.078324 -8.142 1.09e-14 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4776 on 296 degrees of freedom
Multiple R-squared: 0.6494, Adjusted R-squared: 0.6459
F-statistic: 182.8 on 3 and 296 DF, p-value: < 2.2e-16
tbl_regression
Characteristic | Beta (95% CI)1 | p-value |
---|---|---|
(Intercept) | -2.2 (-3.7 to -0.76) | 0.003 |
AGE | -0.02 (-0.03 to -0.02) | <0.001 |
HEIGHT | 0.11 (0.08 to 0.13) | <0.001 |
BIOL.SEX | ||
Male | — | |
Female | -0.64 (-0.79 to -0.48) | <0.001 |
Adjusted R² | 0.646 | |
No. Obs. | 300 | |
1 CI = Confidence Interval |
In this model BIOL.SEX
is a categorical variable with levels Male
and Female
, where Male
is the first ordered level. The estimate shown is for females compared to males.
The fitted regression equation for the model with gender is
\[ \hat{y} = -2.24 - 0.02*AGE + 0.11*HEIGHT - 0.64*I(BIOL.SEX==`Female`) \]
Still can use template language
The interpretation of categorical variables still falls under the template language of “for every one unit increase in \(X_{p}\), \(Y\) changes by \(b_{p}\)”. So a 1 “unit” change is females (\(X_{3}=1\)) compared to males (\(X_{3}=0\)).
Let’s consider the effect of city environment on FEV. For those unfamiliar with the region, these cities represent very different environmental profiles.
Characteristic | N = 300 |
---|---|
AREA, n (%) | |
Burbank | 48 (16) |
Lancaster | 98 (33) |
Long Beach | 38 (13) |
Glendora | 116 (39) |
I do not do anything to the variable AREA
itself aside from add it into the model.
Base R
Call:
lm(formula = FEV1 ~ AGE + HEIGHT + BIOL.SEX + AREA, data = fev_long)
Residuals:
Min 1Q Median 3Q Max
-1.32809 -0.29573 0.00578 0.31588 1.37041
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.250564 0.752414 -2.991 0.00302 **
AGE -0.022801 0.004151 -5.493 8.59e-08 ***
HEIGHT 0.103866 0.010555 9.841 < 2e-16 ***
BIOL.SEXFemale -0.642168 0.078400 -8.191 8.10e-15 ***
AREALancaster 0.031549 0.084980 0.371 0.71072
AREALong Beach 0.061963 0.104057 0.595 0.55199
AREAGlendora 0.121589 0.082097 1.481 0.13967
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4777 on 293 degrees of freedom
Multiple R-squared: 0.6529, Adjusted R-squared: 0.6458
F-statistic: 91.86 on 6 and 293 DF, p-value: < 2.2e-16
tbl_regression
Characteristic | Beta (95% CI)1 | p-value |
---|---|---|
(Intercept) | -2.3 (-3.7 to -0.77) | 0.003 |
AGE | -0.02 (-0.03 to -0.01) | <0.001 |
HEIGHT | 0.10 (0.08 to 0.12) | <0.001 |
BIOL.SEX | ||
Male | — | |
Female | -0.64 (-0.80 to -0.49) | <0.001 |
AREA | ||
Burbank | — | |
Lancaster | 0.03 (-0.14 to 0.20) | 0.71 |
Long Beach | 0.06 (-0.14 to 0.27) | 0.55 |
Glendora | 0.12 (-0.04 to 0.28) | 0.14 |
Adjusted R² | 0.646 | |
No. Obs. | 300 | |
1 CI = Confidence Interval |
AREA AREALancaster AREALong Beach AREAGlendora
1 "Burbank" "0" "0" "0"
185 "Glendora" "0" "0" "1"
49 "Lancaster" "1" "0" "0"
150 "Long Beach" "0" "1" "0"
AREALancaster
variable and 0 otherwise.Do not have to manually do this
This is automatically done by the software for you in most programs
\[ Y = \beta_{0} + \beta_{1}*x_{age} + \beta_{2}x_{ht} + \beta_{3}x_{sex} + \beta_{4}x_{4} + \beta_{5}x_{5} + \beta_{6}x_{6} \]
where
AREA='Lancaster'
, and 0 otherwiseAREA='Long Beach'
, and 0 otherwiseAREA='Glendora'
, and 0 otherwiseThe coefficients for the other levels of the categorical variable are interpreted as the effect of that variable on the outcome in compared to the reference level.
Characteristic | Beta (95% CI)1 | p-value |
---|---|---|
(Intercept) | -2.3 (-3.7 to -0.77) | 0.003 |
AGE | -0.02 (-0.03 to -0.01) | <0.001 |
HEIGHT | 0.10 (0.08 to 0.12) | <0.001 |
BIOL.SEX | ||
Male | — | |
Female | -0.64 (-0.80 to -0.49) | <0.001 |
AREA | ||
Burbank | — | |
Lancaster | 0.03 (-0.14 to 0.20) | 0.71 |
Long Beach | 0.06 (-0.14 to 0.27) | 0.55 |
Glendora | 0.12 (-0.04 to 0.28) | 0.14 |
Adjusted R² | 0.646 | |
No. Obs. | 300 | |
1 CI = Confidence Interval |
None of these differences are significant, so we can conclude that after controlling for age, height and biological sex, residential area does not have an effect on the persons lung function. This can also be noticed by the lack of increase in the \(R^{2}\) value compared to the simpler model
🔗 https://math615.netlify.app / Multiple Linear Regression