Describing Relationships between variables

Robin Donatello

2023-09-18

Bivariate: Association between 2 variables

Naming conventions

Response Explanatory
y x
outcome predictor
dependent variable independent variable
covariate
feature


Model notation: \(y \sim x\)

Types of combinations

  • Categorical response and categorical explanatory variable. (C ~ C)
  • Quantitative response and categorical explanatory variable. (Q ~ C)
  • Quantitative response and quantitative explanatory variable. (Q ~ Q)

library(tidyverse); library(ggpubr)
library(palmerpenguins);  library(sjPlot)
pen <- penguins

C ~ C

Categorical Response vs Categorical Explanatory

Show the code
table(pen$species, pen$island)
           
            Biscoe Dream Torgersen
  Adelie        44    56        52
  Chinstrap      0    68         0
  Gentoo       124     0         0
Show the code
table(pen$species, pen$island) |> 
  prop.table(margin=2) |> round(2)
           
            Biscoe Dream Torgersen
  Adelie      0.26  0.45      1.00
  Chinstrap   0.00  0.55      0.00
  Gentoo      0.74  0.00      0.00
  • All of the 52 penguins on Torgersen island are the Adelie species.
  • 74% of penguins on Biscoe island are Gentoo.
Show the code
plot_xtab(pen$island, grp=pen$species, 
          show.total = FALSE)

Watch your margins

Always double check your work

One of the most common places for a mistake when creating a plot or a table between two categorical variables is by not paying close attention to the choice of denominator. And then also confirming the interpretation matches the table, which matches the plot.

Watch your margins

Row Percents

Show the code
table(pen$species, pen$island) |> 
  prop.table(margin=1) |> round(2)
           
            Biscoe Dream Torgersen
  Adelie      0.29  0.37      0.34
  Chinstrap   0.00  1.00      0.00
  Gentoo      1.00  0.00      0.00

29% of Adelie penguins are on Biscoe Island.

Column Percents

Show the code
table(pen$species, pen$island) |> 
  prop.table(margin=2) |> round(2)
           
            Biscoe Dream Torgersen
  Adelie      0.26  0.45      1.00
  Chinstrap   0.00  0.55      0.00
  Gentoo      0.74  0.00      0.00

74% of penguins on Biscoe island are Gentoo.

Distribution of islands for each species

Show the code
plot_xtab(pen$species, grp=pen$island, 
          margin = "row", show.total = FALSE)

29% of Adelie penguins are on Biscoe Island.

Distribution of species on each island.

Show the code
plot_xtab(pen$island, grp=pen$species, 
          margin = "row", show.total = FALSE)

74% of penguins on Biscoe island are Gentoo.

Q ~ C

Quantitative Response vs Categorical Explanatory

Mean, median, sd, IQR of the quantitative variable for each level of the categorical level.

Show the code
pen %>% group_by(species) %>% 
  summarize(mean = mean(bill_depth_mm, na.rm = TRUE), 
            median = median(bill_depth_mm, na.rm = TRUE), 
            sd = sd(bill_depth_mm, na.rm = TRUE), 
            IQR = IQR(bill_depth_mm, na.rm = TRUE))
Show the code
pen %>% group_by(species) %>% 
  summarize(mean   = mean(bill_depth_mm, na.rm = TRUE), 
            median = median(bill_depth_mm, na.rm = TRUE), 
            sd     = sd(bill_depth_mm, na.rm = TRUE), 
            IQR    = IQR(bill_depth_mm, na.rm = TRUE))
# A tibble: 3 × 5
  species    mean median    sd   IQR
  <fct>     <dbl>  <dbl> <dbl> <dbl>
1 Adelie     18.3   18.4 1.22   1.5 
2 Chinstrap  18.4   18.4 1.14   1.90
3 Gentoo     15.0   15   0.981  1.5 

Gentoo penguins have lower average bill depth compared to Adelie or Chinstrap (15.0mm vs 18.3 and 18.4mm respectively). Chinstrap however have a larger IQR at 1.9 compared to 1.5 for the others.

Overlaid density plots

Show the code
gghistogram(pen, 
    x = "bill_depth_mm", fill = "species", 
    add_density = TRUE, add="mean")

Side by side boxplots

Show the code
ggviolin(pen, 
  x="species", y = "bill_depth_mm", 
  color = "species", add = c("mean", "boxplot"))

The distribution of bill depth are fairly normal for each species, with some higher end values causing a slight right skew for Adelie and Gentoo.

Q ~ Q

Quantitative Response vs Quantitative Explanatory

Show the code
cor(pen$flipper_length_mm, pen$body_mass_g, 
    use="pairwise.complete.obs")
[1] 0.8712018

The penguin flipper length (mm) has a strong positive correlation with body mass (g), r=0.87

Show the code
ggscatter(pen, 
  x="flipper_length_mm", y = "body_mass_g")

Show the code
ggscatter(pen, 
  x="flipper_length_mm", y = "body_mass_g", 
  add = "loess", conf.int = TRUE)

The relationship between flipper length and body mass in penguins is relatively linear, but there may be possible clustering on a third variable. There appears to be two groups below and above a flipper length of about 205mm.