Preparing your data for analysis

Robin Donatello

2023-08-30

A line diagram of figure 1.1 from the book R for Data science

Data science process

Requirements

These notes assume the following…

  • you have R and R Studio installed
  • at least the tidyverse and here packages installed
  • Converted your Math 615 folder to an R project
  • have successfully rendered a quarto document to PDF.

See ASCN Ch 19.1-19.8 for details if you still need help.

Import

  1. Open your MATH 615 R Project
  2. Create a new Quarto file named dm_dataset.qmd where dataset is YOUR dataset name. E.g. dm_addhealth.qmd.
  3. Save this file in your Math615/scripts folder.
  4. Copy the following code into a new code chunk.
Show the code
library(tidyverse)
## raw <- read_csv(here::here("data", "data.csv"))
  1. Replace the data.csv with YOUR data set name exactly as it shows in your files window (bottom right).
  2. Run this code chunk only (not render)

Confirm import was successful

Okay, did it work?

  • Look in the top right Environment pane. Do you see a dataset named raw? Does it have an expected number of rows and columns?
  • Click on the table icon to open the data set in a spreadsheet like view. Are the variable names correct? Does the data look correct?

Initial Data Screening

Use functions like str() or glimpse() to see what data type R thinks your variables are for the whole data set

Show the code
glimpse(raw)    # from the tidyverse/dplyr dataset
Rows: 344
Columns: 17
$ studyName             <chr> "PAL0708", "PAL0708", "PAL0708", "PAL0708", "PAL0708", "PAL0708", "PAL0708", "PAL0708", "PAL0708", "PAL0708", "PAL0708", "PAL0708", "PAL0708", "PAL0708", "PAL0708", "PA…
$ `Sample Number`       <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 4…
$ Species               <chr> "Adelie Penguin (Pygoscelis adeliae)", "Adelie Penguin (Pygoscelis adeliae)", "Adelie Penguin (Pygoscelis adeliae)", "Adelie Penguin (Pygoscelis adeliae)", "Adelie Peng…
$ Region                <chr> "Anvers", "Anvers", "Anvers", "Anvers", "Anvers", "Anvers", "Anvers", "Anvers", "Anvers", "Anvers", "Anvers", "Anvers", "Anvers", "Anvers", "Anvers", "Anvers", "Anvers"…
$ Island                <chr> "Torgersen", "Torgersen", "Torgersen", "Torgersen", "Torgersen", "Torgersen", "Torgersen", "Torgersen", "Torgersen", "Torgersen", "Torgersen", "Torgersen", "Torgersen",…
$ Stage                 <chr> "Adult, 1 Egg Stage", "Adult, 1 Egg Stage", "Adult, 1 Egg Stage", "Adult, 1 Egg Stage", "Adult, 1 Egg Stage", "Adult, 1 Egg Stage", "Adult, 1 Egg Stage", "Adult, 1 Egg …
$ `Individual ID`       <chr> "N1A1", "N1A2", "N2A1", "N2A2", "N3A1", "N3A2", "N4A1", "N4A2", "N5A1", "N5A2", "N6A1", "N6A2", "N7A1", "N7A2", "N8A1", "N8A2", "N9A1", "N9A2", "N10A1", "N10A2", "N11A1…
$ `Clutch Completion`   <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "No", "No", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Y…
$ `Date Egg`            <date> 2007-11-11, 2007-11-11, 2007-11-16, 2007-11-16, 2007-11-16, 2007-11-16, 2007-11-15, 2007-11-15, 2007-11-09, 2007-11-09, 2007-11-09, 2007-11-09, 2007-11-15, 2007-11-15,…
$ `Culmen Length (mm)`  <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, 42.0, 37.8, 37.8, 41.1, 38.6, 34.6, 36.6, 38.7, 42.5, 34.4, 46.0, 37.8, 37.7, 35.9, 38.2, 38.8, 35.3, 40.6, 40.5, 37…
$ `Culmen Depth (mm)`   <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, 20.2, 17.1, 17.3, 17.6, 21.2, 21.1, 17.8, 19.0, 20.7, 18.4, 21.5, 18.3, 18.7, 19.2, 18.1, 17.2, 18.9, 18.6, 17.9, 18…
$ `Flipper Length (mm)` <dbl> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186, 180, 182, 191, 198, 185, 195, 197, 184, 194, 174, 180, 189, 185, 180, 187, 183, 187, 172, 180, 178, 178, 188, 184,…
$ `Body Mass (g)`       <dbl> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, 4250, 3300, 3700, 3200, 3800, 4400, 3700, 3450, 4500, 3325, 4200, 3400, 3600, 3800, 3950, 3800, 3800, 3550, 3200, 31…
$ Sex                   <chr> "MALE", "FEMALE", "FEMALE", NA, "FEMALE", "MALE", "FEMALE", "MALE", NA, NA, NA, NA, "FEMALE", "MALE", "MALE", "FEMALE", "FEMALE", "MALE", "FEMALE", "MALE", "FEMALE", "M…
$ `Delta 15 N (o/oo)`   <dbl> NA, 8.94956, 8.36821, NA, 8.76651, 8.66496, 9.18718, 9.46060, NA, 9.13362, 8.63243, NA, NA, NA, 8.55583, NA, 9.18528, 8.67538, 8.47827, 9.11616, 8.73762, 8.66271, 9.222…
$ `Delta 13 C (o/oo)`   <dbl> NA, -24.69454, -25.33302, NA, -25.32426, -25.29805, -25.21799, -24.89958, NA, -25.09368, -25.21315, NA, NA, NA, -25.22588, NA, -25.06691, -25.13993, -25.23319, -24.7722…
$ Comments              <chr> "Not enough blood for isotopes.", NA, NA, "Adult not sampled.", NA, NA, "Nest never observed with full clutch.", "Nest never observed with full clutch.", "No blood samp…
Show the code
str(raw)    
tibble [344 × 17] (S3: tbl_df/tbl/data.frame)
 $ studyName          : chr [1:344] "PAL0708" "PAL0708" "PAL0708" "PAL0708" ...
 $ Sample Number      : num [1:344] 1 2 3 4 5 6 7 8 9 10 ...
 $ Species            : chr [1:344] "Adelie Penguin (Pygoscelis adeliae)" "Adelie Penguin (Pygoscelis adeliae)" "Adelie Penguin (Pygoscelis adeliae)" "Adelie Penguin (Pygoscelis adeliae)" ...
 $ Region             : chr [1:344] "Anvers" "Anvers" "Anvers" "Anvers" ...
 $ Island             : chr [1:344] "Torgersen" "Torgersen" "Torgersen" "Torgersen" ...
 $ Stage              : chr [1:344] "Adult, 1 Egg Stage" "Adult, 1 Egg Stage" "Adult, 1 Egg Stage" "Adult, 1 Egg Stage" ...
 $ Individual ID      : chr [1:344] "N1A1" "N1A2" "N2A1" "N2A2" ...
 $ Clutch Completion  : chr [1:344] "Yes" "Yes" "Yes" "Yes" ...
 $ Date Egg           : Date[1:344], format: "2007-11-11" "2007-11-11" "2007-11-16" "2007-11-16" ...
 $ Culmen Length (mm) : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
 $ Culmen Depth (mm)  : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
 $ Flipper Length (mm): num [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
 $ Body Mass (g)      : num [1:344] 3750 3800 3250 NA 3450 ...
 $ Sex                : chr [1:344] "MALE" "FEMALE" "FEMALE" NA ...
 $ Delta 15 N (o/oo)  : num [1:344] NA 8.95 8.37 NA 8.77 ...
 $ Delta 13 C (o/oo)  : num [1:344] NA -24.7 -25.3 NA -25.3 ...
 $ Comments           : chr [1:344] "Not enough blood for isotopes." NA NA "Adult not sampled." ...
 - attr(*, "spec")=
  .. cols(
  ..   studyName = col_character(),
  ..   `Sample Number` = col_double(),
  ..   Species = col_character(),
  ..   Region = col_character(),
  ..   Island = col_character(),
  ..   Stage = col_character(),
  ..   `Individual ID` = col_character(),
  ..   `Clutch Completion` = col_character(),
  ..   `Date Egg` = col_date(format = ""),
  ..   `Culmen Length (mm)` = col_double(),
  ..   `Culmen Depth (mm)` = col_double(),
  ..   `Flipper Length (mm)` = col_double(),
  ..   `Body Mass (g)` = col_double(),
  ..   Sex = col_character(),
  ..   `Delta 15 N (o/oo)` = col_double(),
  ..   `Delta 13 C (o/oo)` = col_double(),
  ..   Comments = col_character()
  .. )

Both views show you the variable names, data types, and what the data in the first few rows looks like.

Initial Data Screening - single variable

You can also look at the data type for a single variable at a time.

Show the code
typeof(raw$Island) 
[1] "character"
Show the code
class(raw$Island) 
[1] "character"
Show the code
str(raw$Island) 
 chr [1:344] "Torgersen" "Torgersen" "Torgersen" "Torgersen" "Torgersen" "Torgersen" "Torgersen" "Torgersen" "Torgersen" "Torgersen" "Torgersen" "Torgersen" "Torgersen" "Torgersen" "Torgersen" ...

Check plausibility of data values

Confirm these values follow the expected values according to the codebook.


Use table() for categorical variables

Show the code
table(raw$Island)

   Biscoe     Dream Torgersen 
      168       124        52 

and summary() on numeric variables to see the range of values present.

Show the code
summary(raw$`Body Mass (g)`)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   2700    3550    4050    4202    4750    6300       2 

Data Prep questions

Questions to ask yourself (and the data) while reviewing the codebook to choose variables to be used in an analysis.

  1. Are there codes that indicate missing? E.g. MISSING or -99?
  2. Do you need to make response codes more logical?
    • Some systems will record 1=YES and 2=NO. This should be changed to 0=NO.
  3. Do you need to recode numerical variables to categorical?
    • Sometimes categorical data will be recorded as 1, 2, 3 etc when those numbers represent named categories.
  4. Do you need to create secondary variables such as an average across measures to create a score.
  5. Are the variable names user friendly? Mixtures of CAPS and lower case, names with spaces or special characters should all be changed.

How do I know what needs to be done?

Some of these answers will come only after you look at your data. This can be looking at the raw data itself but also looking at tables and charts generated from the data.

Often when you try to create a plot or table you will encounter an error or something odd looking that will be the notification that something has to be adjusted.

How do I actually make the necessary changes?

  • After you identify what you need to do, you have to find an example of how to code the task you want.
  • The Applied Stats Course notes has specific examples on how to handle common changes
  • Take it one variable at a time.

Looking for help

  • The internets can be a wonderful thing when it comes to learning new things.
  • The R landscape moves fast. Restrict your search to the past ~4 years.
  • Use keywords like R project or searching for package vignettes
  • Stack Overflow and the Posit Community can have gems!

You will start to build up your repertoire of bookmarked sites you like to use.

Using Chat GPT to write code

Example prompt

Using the palmer penguins dataset in R write me R code to change the factor order of the islands variable to put Dream first

  • I told it to use a built in data set
  • I told it to use R
  • I used R terminology such as “factor order” and “variable”

What did it give?

Sure, here’s an example of how you can change the factor order of the island variable in the Palmer Penguins dataset to put “Dream” first using R:

# Load necessary packages
library(palmerpenguins)

# Load the dataset
data("penguins")

# Check the current levels of the 'island' factor
levels(penguins$island)

# Reorder the levels of the 'island' factor to put "Dream" first
penguins$island <- factor(penguins$island, levels = c("Dream", "Torgersen", "Biscoe"))

# Check the new order of levels
levels(penguins$island)

In this code, we’re first loading the palmerpenguins package, which provides access to the dataset. Then we load the dataset itself using data("penguins"). We check the current levels of the island factor using levels(penguins$island).

Next, we use the factor() function to reorder the levels of the island variable. We specify the desired order of levels using the levels argument. In this case, we put “Dream” first, followed by the other islands. Finally, we check the new order of levels to make sure the change was successful.

Make sure you have the palmerpenguins package installed. You can install it using install.packages("palmerpenguins") if you haven’t done so already.

WOW!

Not only did it write the code, it wrote literate code. It has code comments (lines 1, 4, 7, prefixed with a #), and a full text explanation.

This is GREAT for learning how to write code!

Plagerism warning

Read the Syllabus on what is expected of you if you use this tool to aid in your writing https://math615.netlify.app/syllabus#use-of-ai

⚠️ Trust but verify! AI is not always correct! Also, this does not replace the necessity of you learning.

Closing thoughts

  • Do not underestimate the importance of this step
  • It will take you far, far longer than you anticipate to ‘clean’ your data
  • Writing code (in any language) will be challenging, but will pay off in the long run