2025-09-08
Once the data are available from a study there are still a number of steps that must be undertaken to get them into shape for analysis.
One of the most misunderstood parts of the analysis process is the data preparation stage. To say that 70% of any analysis is spent on the data management stage is not an understatement.
Fig ref: Updated from Grolemund & Wickham’s classis R4DS schematic, envisioned by Dr. Julia Lowndes for her 2019 useR! keynote talk and illustrated by Allison Horst.
Reproducibility is the ability for any researcher to take the same data set and run the same set of software program instructions as another researcher and achieve the same results.
Not the same as replicability where you re-run an experiment and achieve the same outcomes.
The goal is to create an exact record of what was done to a data set to produce a specific result.
Figure Credits: Roger Peng
.R, .Rmd, .sas, .sps, .do, .ipynb)In this model of the data science process, you start with data import and tidying. Next, you understand your data with an iterative cycle of transforming, visualizing, and modeling. You finish the process by communicating your results to other humans. Ref R for Data Science 2nd ed
Regardless of the programming language you choose to use, using scripts will make this process reproducible and more powerful with less pain points.

Required
Using R Projects is a required part of this class. Spend a few minutes turning your Math 615 folder into an R project now.