Data Analysis for Graduate Research – Describing Distributions of Data

Meet the Palmer Penguins

Show the code

library(palmerpenguins)
pen <- penguins # because i don't want to type out penguins every time

Single Categorical

Frequencies (N)

Show the code

table(pen$species)


   Adelie Chinstrap    Gentoo 
      152        68       124

Percents (%)

Show the code

table(pen$species) |> proportions() |> round(digits=2)


   Adelie Chinstrap    Gentoo 
     0.44      0.20      0.36

Penguin species Adelie make up 44% of the sample (n=152)

Show the code

sjPlot::plot_frq(pen$species) + xlab("Species")

Must include both the count N and the percent %.
Don’t need to describe every bar, just the 1-2 that stand out. E.g. largest and smallest? Categories that you care about.

Penguin species Adelie make up 44% of the sample (n=152)

Single Numeric

Show the code

summary(pen$bill_depth_mm)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  13.10   15.60   17.30   17.15   18.70   21.50       2

The average bill depth is 17.15mm, with a median of 17.3mm

Show the code

ggpubr::gghistogram(pen$bill_depth_mm, add_density = TRUE)

Show the code

ggpubr::ggviolin(pen$bill_depth_mm, add = c("jitter", "boxplot")) + coord_flip()

The distribution of bill depth appears to be bimodal with peaks around 15 and 18mm.

Show the code

summary(pen$bill_depth_mm)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  13.10   15.60   17.30   17.15   18.70   21.50       2

Show the code

sd(pen$bill_depth_mm, na.rm=TRUE)

[1] 1.974793

Show the code

IQR(pen$bill_depth_mm, na.rm=TRUE)

[1] 3.1

Bill depth ranges from 13.1 to 21.5mm, has an IQR of 3.1mm and a standard deviation of 1.9mm.

Describe the center, shape and spread.
Include numbers
Always in context of the problem

The average penguin bill depth is 17.15mm, with a standard deviation of 1.9mm. Ranging from 13.1 to 21.5mm, there is a bimodal pattern with peaks around 15 and 18mm but otherwise no skew is noted and no outliers are present.

How to create graphs

Similar to the data management section, after identifying what you want to do, you go look up how to do that thing.
Don’t expect to remember the exact code yet, just know where to look up an example and copy from there
copy/paste/pray
Keep graphs simple until you get more comfortable.

Additional Materials

PMA6 Chapter 4
Applied Stats course Notes Chapter 2
sjPlot vignette: https://strengejacke.github.io/sjPlot/index.html
ggpubr vignette: https://rpkgs.datanovia.com/ggpubr/
ggplot vignette: https://ggplot2.tidyverse.org/index.html
gtsummary vignette: https://www.danieldsjoberg.com/gtsummary/index.html
R graphics cookbook: https://r-graphics.org/

Inspiration

https://r-graph-gallery.com/

Bonus

Nice summary table of multiple variables using gtsummary. Great option for your Table 1.

Show the code

library(gtsummary)

pen %>% select(island, bill_depth_mm) %>%
  tbl_summary()

Characteristic	N = 344¹
island
Biscoe	168 (49%)
Dream	124 (36%)
Torgersen	52 (15%)
bill_depth_mm	17.30 (15.60, 18.70)
Unknown	2
¹ n (%); Median (IQR)

Default

Show the code

pen %>% select(island, bill_depth_mm) %>%
  tbl_summary(statistic = list(
      all_continuous() ~ "{mean} ({sd})",
      all_categorical() ~ "{n} / {N} ({p}%)"
    ))

Characteristic	N = 344¹
island
Biscoe	168 / 344 (49%)
Dream	124 / 344 (36%)
Torgersen	52 / 344 (15%)
bill_depth_mm	17.15 (1.97)
Unknown	2
¹ n / N (%); Mean (SD)

Custom (preferred) summary statistics display

Describing Distributions of Data

Motivation

Level of care depends on the audience

Graphing with intent

Choosing Appropriate Visualization

Meet the Palmer Penguins

Single Categorical

Single Numeric

How to create graphs

Additional Materials

Inspiration

Bonus