Data Analysis for Graduate Research - Inference on proportions

Opening Remarks

If we are only concerned with testing the hypothesis that the proportion of successes between two groups are equal , we can leverage the Normal distribution and conduct a Z-test for two proportions.

However in this class we will use the more generalizable model via Chi-squared test of association/equal proportions.

Chi-squared test of equal proportions

A 30-year study was conducted with nearly 90,000 female participants. (Miller AB. 2014) During a 5-year screening period, each woman was randomized to one of two groups: in the first group, women received regular mammograms to screen for breast cancer, and in the second group, women received regular non-mammogram breast cancer exams. No intervention was made during the following 25 years of the study, and we’ll consider death resulting from breast cancer over the full 30-year period.

Results from study

	Alive	Dead	Sum
Control	44405	505	44910
Mammogram	44425	500	44925
Sum	88830	1005	89835

The explanatory variable is treatment (additional mammograms), and the response variable is death from breast cancer.

Are these measures associated?

Assume independence/no association

If mammograms are more effective than non-mammogram breast cancer exams, then we would expect to see additional deaths from breast cancer in the control group (there is a relationship).
If mammograms are not as effective as regular breast cancer exams, we would expect to see no difference in breast cancer deaths in the two groups (there is no relationship).
Need to figure out how many deaths would be expected, if there was no relationship between treatment death by breast cancer,
Then examine the residuals - the difference between the observed counts and the expected counts in each cell.

Table notation

Tables can be described by rows and columns. So the cell in the top left is and .

	Alive	Dead	Total
Mammo
Control
Total

In our DATA = MODEL + RESIDUAL framework, the DATA (, etc)is the observed counts and the MODEL is the expected counts .

Calculating the expected count

Since we assume the variables are independent (unless the data show otherwise) the expected count for each cell is calculated as the row total times the column total for that cell, divided by the overall total.

In a probability framework this is stated as two variables and are independent if

Calculating the residuals

The residuals are calculated as , observed minus expected.

Expected

Show the code

chisq.test(a$Tx, a$Outcome)$expected %>% kable(digits=2)

	Alive	Dead
Control	44407.58	502.42
Mammogram	44422.42	502.58

Residuals

Show the code

chisq.test(a$Tx, a$Outcome)$residuals %>% kable(digits=3)

	Alive	Dead
Control	-0.012	0.115
Mammogram	0.012	-0.115

If mammograms were not associated with survival, there were 0.01 fewer people still alive than expected, and 0.11 more people dead.

test statistic

The test statistic is defined as the sum of the squared residuals, divided by the expected counts, and follows a distribution with degrees of freedom (#rows -1)(#cols -1).

Show the code

dist_chisq(chi2=.027, deg.f=1, geom.colors=c("#FDE725FF", "#440154FF"))

Assumptions for this distribution to hold

Sample size is large
for at least 80% of cells

Conclusion

In this example, the test statistic was 0.017 with p-value 0.895.

So there is not enough reason to believe that mammograms in addition to regular exams are associated with a reduced risk of death due to breast cancer.

This example demonstrated how we examine the residuals to see how closely our DATA fits a hypothesized MODEL.

Example: Income and General Health

Using the Addhealth data set, what can we say about the relationship between smoking status and a person’s perceived level of general health?

Is there an association between lifetime smoking status and perceived general health?

1. Identify response and explanatory variables

The binary explanatory variable is whether the person has ever smoked an entire cigarette (eversmoke_c).
The categorical explanatory variable is the person’s general health (genhealth) and has levels “Excellent”, “Very Good”, “Good”, “Fair”, and “Poor”.

Show the code

plot_xtab(grp=addhealth$eversmoke_c, x=addhealth$genhealth, 
                  show.total = FALSE, margin="row", legend.title="") + 
   theme_bw() + theme(legend.position="bottom") + xlab("General Health")

The percentage of smokers seems to increase as the general health status decreases. Almost three-quarters (73%, n=40) of those reporting poor health have smoked an entire cigarette at least once in their life compared to 59% (n=573) of those reporting excellent health.

Plot the % who smoked at least once. This method only works when there is a binary outcome.

Show the code

addhealth %>% 
  group_by(genhealth) %>% 
  summarize(p = mean(eversmoke_c == "Smoked at least once", na.rm=TRUE)*100, 
            n = n()) %>% 
  na.omit() %>% 
  ggplot(aes(x=genhealth, y=p, color=genhealth)) + 
  geom_point(aes(size = n)) + 
  scale_y_continuous(limits=c(0, 100)) + 
  geom_segment(aes(x=genhealth, xend=genhealth, y=0, yend=p)) +
  scale_color_discrete(guide="none") + theme_bw() + 
  ylab("Proportion of smokers") + 
  xlab("Perceived General Health")

A stacked barchart can work for any size categorical outcome, but the more categories/segments, the harder it will be to see the %’s.

Show the code

addhealth %>% 
  count(genhealth, eversmoke_c) %>% 
  na.omit() %>%
  group_by(genhealth)%>%
  mutate(p = n/sum(n)*100) %>% 
ggplot(aes(x=genhealth, y=p, fill=eversmoke_c, 
           label = paste0(round(p, 1), "%"))) + 
  geom_col() + theme_bw() + 
  scale_fill_discrete(name = "") + 
  geom_text(position=position_stack(0.5)) + 
  ylab("Proportion of smokers") + 
  xlab("Perceived General Health")

3. Write the null and research hypothesis in words and symbols.

Null Hypothesis: The proportion of smokers is the same across all levels of general health.
Alternate Hypothesis: At least one group has a different proportion of smokers compared to the other general health groups.

Let be the true proportion of smokers in each of the 5 health categories: (1) Excellent to (5) Poor.

At least one is different.

4. State and justify the analysis model. Check assumptions.

I will conduct a test of association because both variables are categorical.
The expected value in each combination of smoking status and general health is at least 5.

Show the code

chisq.test(addhealth$genhealth, addhealth$eversmoke_c)$expected

                   addhealth$eversmoke_c
addhealth$genhealth Never Smoked Smoked at least once
          Excellent    339.50324            636.49676
          Very good    681.44144           1277.55856
          Good         584.04297           1094.95703
          Fair         148.88052            279.11948
          Poor          19.13184             35.86816

5. Conduct the test

And make a decision about the plausibility of the alternative hypothesis.

Show the code

chisq.test(addhealth$genhealth, addhealth$eversmoke_c)


    Pearson's Chi-squared test

data:  addhealth$genhealth and addhealth$eversmoke_c
X-squared = 30.795, df = 4, p-value = 3.371e-06

We have strong evidence in favor of the alternative hypothesis, p<.0001

6. Write a conclusion

We can conclude that there is an association between ever smoking a cigarette in their life and perceived general health ().

Multiple comparisons: Which group is different?

Just like with ANOVA, if we find that the Chi-squared test indicates that at least one proportion is different from the others, it’s our job to figure out which ones might be different! We will analyze the residuals to accomplish this.

Examine the residuals

The residuals () are automatically stored in the model output. You can either print them out and look at the values directly or create a plot. You’re looking for combinations that have much higher, or lower expected proportions.

Show the code

health.smoke.model <- chisq.test(addhealth$genhealth, addhealth$eversmoke_c)
health.smoke.model$residuals |> round(2)

                   addhealth$eversmoke_c
addhealth$genhealth Never Smoked Smoked at least once
          Excellent         3.45                -2.52
          Very good         0.48                -0.35
          Good             -2.44                 1.78
          Fair             -1.06                 0.77
          Poor             -0.94                 0.69

Show the code

plot.residuals <- health.smoke.model$residuals %>% data.frame()
ggplot(plot.residuals, aes(x=addhealth.genhealth, y=addhealth.eversmoke_c)) +
       geom_raster(aes(fill=Freq)) +  scale_fill_viridis_c()

The proportion of non-smokers in the excellent health category is much higher than expected if there were no relationship between these two variables.

All pairwise comparisons

Show the code

# library(rcompanion)
smoke.by.genhealth <- table(addhealth$genhealth, addhealth$eversmoke_c)
rcompanion::pairwiseNominalIndependence(smoke.by.genhealth, fisher = FALSE, gtest=FALSE) |> kable(digits = 3)

Comparison	p.Chisq	p.adj.Chisq
Excellent : Very good	0.002	0.007
Excellent : Good	0.000	0.000
Excellent : Fair	0.001	0.004
Excellent : Poor	0.039	0.079
Very good : Good	0.008	0.020
Very good : Fair	0.151	0.252
Very good : Poor	0.212	0.303
Good : Fair	0.840	0.840
Good : Poor	0.529	0.588
Fair : Poor	0.498	0.588

Using the adjusted (for multiple comparisons) p-value column, the proportion of smokers in the Excellent group (41.3%) is significantly different from nearly all other groups (35.4% for Very Good, 31.3% for Good, 31.8% for Fair).

But why is it not significantly different from the 27.3% in the Poor group?

The effect of small sample sizes

Why is 41.3% significantly different from 35.4%, but NOT different from 27.3%?

The standard error is always dependent on the sample size. Here, the number of individuals in the Poor health category is much smaller compared to all other groups - so the margin of error will be larger - making it less likely to be statistically significantly different from other groups.

When you have a small numbers (typically n<10) in one or more cells the value of the expected value for that cell likely will be below, or close to 5. This means distribution of the test statistic can no longer be modeled well using the distribution. So we again go to a non-parametric test that uses a randomization based method.

Fishers Exact Test

Show the code

fisher.test(addhealth$genhealth, addhealth$eversmoke_c, 
            simulate.p.value = TRUE)


    Fisher's Exact Test for Count Data with simulated p-value (based on 2000 replicates)

data:  addhealth$genhealth and addhealth$eversmoke_c
p-value = 0.0004998
alternative hypothesis: two.sided

Then you can use the pairwiseNominalIndependence function for all pairwise comparisons again, with the fisher argument set to TRUE.

Note: the simulate.p.value = TRUE argument here is only needed if the randomization space is larger than your workspace. R will let you know when this is needed.

Inference on proportions

Opening Remarks

Chi-squared test of equal proportions

Results from study

Assume independence/no association

Table notation

Calculating the expected count

Calculating the residuals

test statistic

Conclusion

Example: Income and General Health

1. Identify response and explanatory variables

2. Visualize and summarise

3. Write the null and research hypothesis in words and symbols.

4. State and justify the analysis model. Check assumptions.

5. Conduct the test

6. Write a conclusion

Multiple comparisons: Which group is different?

Examine the residuals

All pairwise comparisons

The effect of small sample sizes

Fishers Exact Test

Chi-Squared test of Association or Independence

C ~ C

Example

Visualize and summarize

State Hypothesis

Test the relationship & make a conclusion.

Other analysis for categorical variables