Describing Distributions of Data

Robin Donatello

2023-09-12

Motivation

Visualizing your data is hands down the most important thing you can learn to do.

  • Screen for data entry errors
  • Out of range variables
  • Mistakes in coding
  • Violation of model assumptions

Level of care depends on the audience

There are three main audiences in mind when creating data visualizations:

  1. For your eyes only (FYEO). These are quick and dirty plots, without annotation. Meant to be looked at once or twice.
    • You’ll create a TON of these. Don’t spend a ton of time on them.
  1. To share with others internally. These mostly need to stand on their own. Axes labels, titles, colors as needed, possibly captions.
    • You’ll create a lot of these, and you’ll get better with practice at adding necessary annotation without a lot of time
  1. Professional - Contains all bells and whistles needed to make it publication quality.
    • You’ll create very few of these, but they demand a lot of time, detail and thought.

Graphing with intent

Along with having the audience in mind, it is important to give thought to the purpose of the chart.

The effectiveness of any visualization can be measured according to how well it fulfills the tasks it was designed for. (A. Cairo, 2018).

Choosing Appropriate Visualization

75% of your choice is determined by the data type

https://r-graph-gallery.com/

Meet the Palmer Penguins

Show the code
library(palmerpenguins)
pen <- penguins # because i don't want to type out penguins every time

Single Categorical

Frequencies (N)

Show the code
table(pen$species)

   Adelie Chinstrap    Gentoo 
      152        68       124 

Percents (%)

Show the code
table(pen$species) |> proportions() |> round(digits=2)

   Adelie Chinstrap    Gentoo 
     0.44      0.20      0.36 

Penguin species Adelie make up 44% of the sample (n=152)

Show the code
sjPlot::plot_frq(pen$species) + xlab("Species")

Single Numeric

Show the code
summary(pen$bill_depth_mm)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  13.10   15.60   17.30   17.15   18.70   21.50       2 
Show the code
sd(pen$bill_depth_mm, na.rm=TRUE)
[1] 1.974793


The average bill depth is 17.15mm, with a standard deviation of 1.9mm.

Show the code
summary(pen$bill_depth_mm)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  13.10   15.60   17.30   17.15   18.70   21.50       2 
Show the code
IQR(pen$bill_depth_mm, na.rm=TRUE)
[1] 3.1


Bill depth ranges from 13.1 to 21.5mm, and has an IQR of 3.1mm.

Show the code
ggpubr::gghistogram(pen$bill_depth_mm, add_density = TRUE)

Show the code
ggpubr::ggviolin(pen$bill_depth_mm, add = c("jitter", "boxplot")) + coord_flip()

The distribution of bill depth appears to be bimodal with peaks around 15 and 18mm.

How to create graphs

  • Similar to the data management section, after identifying what you want to do, you go look up how to do that thing.
  • Don’t expect to remember the exact code yet, just know where to look up an example and copy from there
    • Prior semester SPSS students crated a HackMD collaborative notes file for sharing code. [LINK]
    • Need these for other languages!
  • copy/paste/pray
  • Keep graphs simple until you get more comfortable.

References

  • PMA6 Chapter 4 for appropriate plot choices
  • Applied Stats course Notes Chapter 2.3 for examples and code
  • HackMD Notes for Code specific notes
  • R specific help on the Math 130 class page
  • sjPlot vignette: https://strengejacke.github.io/sjPlot/index.html
  • ggpubr vignette: https://rpkgs.datanovia.com/ggpubr/
  • ggplot vignette: https://ggplot2.tidyverse.org/index.html
  • R graphics cookbook: https://r-graphics.org/