2025-09-10
These notes assume the following…
tidyverse
and here
packages installedSee ASCN Ch 19.1-19.8 for details if you still need help.
data
folder.Okay, did it work?
raw
? Does it have an expected number of rows and columns?Use functions like str()
or glimpse()
to see what data type R thinks your variables are for the whole data set
Rows: 344
Columns: 3
$ Species <chr> "Adelie Penguin (Pygoscelis adeliae)", "Adelie Penguin (Pygoscelis adeliae)", "Adelie Penguin (Pygoscelis adeliae)", "Adelie Penguin (Pygoscelis adeliae)", "Adelie Penguin (P…
$ Island <chr> "Torgersen", "Torgersen", "Torgersen", "Torgersen", "Torgersen", "Torgersen", "Torgersen", "Torgersen", "Torgersen", "Torgersen", "Torgersen", "Torgersen", "Torgersen", "Torg…
$ `Body Mass (g)` <dbl> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, 4250, 3300, 3700, 3200, 3800, 4400, 3700, 3450, 4500, 3325, 4200, 3400, 3600, 3800, 3950, 3800, 3800, 3550, 3200, 3150, 39…
tibble [344 × 3] (S3: tbl_df/tbl/data.frame)
$ Species : chr [1:344] "Adelie Penguin (Pygoscelis adeliae)" "Adelie Penguin (Pygoscelis adeliae)" "Adelie Penguin (Pygoscelis adeliae)" "Adelie Penguin (Pygoscelis adeliae)" ...
$ Island : chr [1:344] "Torgersen" "Torgersen" "Torgersen" "Torgersen" ...
$ Body Mass (g): num [1:344] 3750 3800 3250 NA 3450 ...
- attr(*, "spec")=
.. cols(
.. studyName = col_character(),
.. `Sample Number` = col_double(),
.. Species = col_character(),
.. Region = col_character(),
.. Island = col_character(),
.. Stage = col_character(),
.. `Individual ID` = col_character(),
.. `Clutch Completion` = col_character(),
.. `Date Egg` = col_date(format = ""),
.. `Culmen Length (mm)` = col_double(),
.. `Culmen Depth (mm)` = col_double(),
.. `Flipper Length (mm)` = col_double(),
.. `Body Mass (g)` = col_double(),
.. Sex = col_character(),
.. `Delta 15 N (o/oo)` = col_double(),
.. `Delta 13 C (o/oo)` = col_double(),
.. Comments = col_character()
.. )
Both views show you the variable names, data types, and what the data in the first few rows looks like.
You can also look at the data type for a single variable at a time.
Confirm these values follow the expected values according to the codebook.
Use table()
for categorical variables to see the frequency of unique levels of that variable. Note any levels with less than 10 observations or variables with more than 10 levels.
You can use summary()
on numeric variables to see the range of values present. Note the amount of missing values, the spread of the data, and any out of range values (e.g. 99 or 7 on a binary yes/no that should only be 0/1).
Questions to ask yourself (and the data) while reviewing the codebook to choose variables to be used in an analysis.
janitor::clean_names()
function to resolve this.Some of these answers will come only after you look at your data. This can be looking at the raw data itself but also looking at tables and charts generated from the data.
Often when you try to create a plot or table you will encounter an error or something odd looking that will be the notification that something has to be adjusted.
R project
or searching for package vignette
sYou will start to build up your repertoire of bookmarked sites you like to use.
Example prompt
Using the palmer penguins dataset in R write me R code to change the factor order of the islands variable to put Dream first
What did it give?
Sure, here’s an example of how you can change the factor order of the island
variable in the Palmer Penguins dataset to put “Dream” first using R:
# Load necessary packages
library(palmerpenguins)
# Load the dataset
data("penguins")
# Check the current levels of the 'island' factor
levels(penguins$island)
# Reorder the levels of the 'island' factor to put "Dream" first
penguins$island <- factor(penguins$island, levels = c("Dream", "Torgersen", "Biscoe"))
# Check the new order of levels
levels(penguins$island)
In this code, we’re first loading the palmerpenguins
package, which provides access to the dataset. Then we load the dataset itself using data("penguins")
. We check the current levels of the island
factor using levels(penguins$island)
.
Next, we use the factor()
function to reorder the levels of the island
variable. We specify the desired order of levels using the levels
argument. In this case, we put “Dream” first, followed by the other islands. Finally, we check the new order of levels to make sure the change was successful.
Not only did it write the code, it wrote literate code. It has code comments (lines 1, 4, 7, prefixed with a #
), and a full text explanation.
This is GREAT for learning how to write code!
Plagerism warning
Read the Syllabus on what is expected of you if you use this tool to aid in your writing https://math615.netlify.app/syllabus#use-of-ai
⚠️ Trust but verify! AI is not always correct! Also, this does not replace the necessity of you learning.
.Rdata
file to maintain your factor orderingsselect()
to drop variables that you don’t want anymoredm
file, it will overwrite the cleaned data with the same name - thus ensuring that your analysis data set is always programmatically connected to your raw data set, and all choices documented.