Data Analysis for Graduate Research – Simple Linear Regression Modeling

Purpose of Regression Modeling

Learn more about the relationship between several independent or predictor variables and a quantitative dependent (response) variable.
Regression is widely used in research as it allows us to ask the general question “what is the best predictor of…”, and does “additional variable A” or “additional variable B confound the relationship between my explanatory and response variable?”

Both Regression and Correlation can be used to

Descriptive: Draw inferences regarding the relationship
Predictive: Predict the value of for given values of one or more ’s.

Examples in practice

Educational researchers might want to learn about the best predictors of success in high-school.
Sociologists may want to find out which of the multiple social indicators best predict whether or not a new immigrant group will adapt to their new country of residence.
Biologists may want to find out which factors (i.e. temperature, barometric pressure, humidity, etc.) best predict caterpillar reproduction.

Example - Lung function

Lung function data were obtained from an epidemiological study of households living in four areas with different amounts and types of air pollution. The data set used in PMA6 is a subset of the total data. In this example we use only the data taken on the fathers, all of whom are nonsmokers.

One of the major early indicators of reduced respiratory function is FEV1 or forced expiratory volume in the first second (amount of air exhaled in 1 second). Since it is known that taller males tend to have higher FEV1, we wish to determine the relationship between height and FEV1. We can use regression analysis for both a descriptive and predictive purpose.

Descriptive: Describing the relationship between FEV1 and height
Predictive: Determine the expected or normal FEV1 for a given height

Visualize the relationship

Show the code

ggplot(fev, aes(y=FFEV1, x=FHEIGHT)) + 
  geom_point() + geom_smooth(se=FALSE, col="blue") + 
  geom_smooth(se=FALSE, method = "lm", col="red") + 
      xlab("Height") + ylab("FEV1") + 
      ggtitle("Scatterplot and Regression line of FEV1 \n Versus Height for Males.") + theme_bw()

There does appear to be a tendency for taller men to have higher FEV1. Since this relationship is reasonably linear (the blue loess line is similar to the red linear line) we can write the model the population average FEV as a linear function of height :

The intercept parameter, , represents where the line crosses the y-axis when . The slope parameter, , represents the change in per 1 unit .

Unifying model framework

We know that there is always random noise in real data (DATA = MODEL FIT + RESIDUAL) so we introduce a random error term, and assume the model:

This model states that the random variable to be made up of a predictable part (a linear function of ) and an unpredictable part (the random error, ). The error (residual) term includes the effects of all other factors, known or unknown.

Least Squares Estimation

Most common method of fitting a straight line to two variables.
Also known as “Ordinary Least Squares (OLS)”
Calculates sample statistics and to estimate the population parameter values and
The estimated mean function is
The fitted value, , is the estimated value for point , calculated by plugging in a value for and calculating the result.
The residual is the difference between the observed and the fitted value:

Least Squares Estimation

The estimates and are found such that they minimize the sum of the squared residuals (the unexplained residual error)

For simple linear regression the regression coefficient estimates that minimize the sum of squared errors can be calculated as:

where is the correlation coefficient between and .

term	estimate	std.error	statistic	p.value
(Intercept)	-4.087	1.152	-3.548	0.001
FHEIGHT	0.118	0.017	7.106	0.000

Characteristic	Beta	95% CI¹	p-value
FHEIGHT	0.12	0.09, 0.15	<0.001
No. Obs.	150
R²	0.254
¹ CI = Confidence Interval

Simple Linear Regression Modeling

Purpose & Estimation

Purpose of Regression Modeling

Examples in practice

Example - Lung function

Visualize the relationship

Unifying model framework

Least Squares Estimation

Least Squares Estimation

Partitioning the Variance

Revisiting the Sum of Squares

Sum of Squares - Regression

Fitting the model

Least Squares Estimation - in R

Using this model

Other facts about LS regression

Assumptions

Mathematical Model

Assumption - Independence

Assumption - Linearity

Assumption - Normality

Assumption - Homoscedasticity

Out of range predictions

Model-check

Inference

Distribution of parameter estimates

Calculating Confidence Intervals

Hypotheis Testing

Write a conclusion