Research compenia as R packages

2023-07-20

Description

A research compendium is a collection of all the digital parts of your research project (data, code, documents) which is often required to be published alongside a scientific paper. The structure and tools of R packages provide a standard workflow for building such reproducible research compendia. In this lecture, you will learn how to build, test and publish your research compendia in the form of an R package. This lecture is mainly aimed at people that have some experience with R, R-Studio and Git. But even if you are new to these topics, you will still get a valuable overview of key concepts for reproducible research compendia and tools to build them.

Slides in full screen Download PDF slides

How to build your own research compendium - Step by step

If you find any issues or problems in this how-to, feel free to let me know via email or create an issue on Github. Thanks 🙏

Preparation

Before you start with the how-to, please check that you have the following things set up and running.

R packages needed

First, you need to install some packages. Some of them are essential for the core workflow, others are optional for some more advanced features that are described later in the how-to.

You can run the following code to install the packages that you still need:

packages_essential <- c("devtools", "usethis", "roxygen2")
# install the essential packages
install.packages(setdiff(packages_essential, rownames(installed.packages())))

packages_optional <- c("goodpractice", "testthat", "piggyback")
# install the optional packages
install.packages(setdiff(packages_optional, rownames(installed.packages())))

Set up Git and Github

You only need this if you want to use version control with your research compendium, publish on Github and use Github actions to do automated testing of your package. Otherwise, you can skip this step and all the steps in the following guide that involve Git and Github.

To set everything up

Download and install Git on your machine
Create a GitHub Account

Personal Access token to use with R

Now, you can add a personal access token (PAT) to your .Renviron file. This way, you can be identified to your Github account when you communicate with Github from R.

To set up you PAT call:

usethis::create_github_token()

This will open Github in the browser (you might need to enter your password).

On top, enter a note on what you want to use the token for (e.g. RStudio on my work computer). Then select an appropriate expiration date (90 days is fine in most cases - Github will send you an email when the token is about to expire so you can renew it).

Scroll over the scopes, select ‘repo’, ‘user’, ‘workflow’ (usually this is already the default setting) and then click on ‘Generate token’ at the bottom.

Leave the browser page open, copy the generated token to your clipboard and go back to RStudio. Add your PAT to the .Renviron file (don’t do this if you share your computer with other people or they can access your Github account) by calling:

gitcreds::gitcreds_set()

Follow the prompt in the console and enter your PAT. Now you should be able to work with Github.

Install RTools (only on Windows)

If you are on a windows machine and you want to build R packages, you need to install RTools which contains the toolchains for building packages.

You can check if you already have Rtools installed with

# Returns TRUE if Rtools is correctly installed
devtools::find_rtools()

If you need to install it, you can download Rtools from here. Chose the appropriate version for your version of R and follow the installation process.

Step 1: Create an empty R package

Think of a name for you R package that respects the following rules:

only ASCII letters, numbers and ‘.’
at least two characters
starts with a letter
does not end with ‘.’

In this example, I chose temperatureBerlin because my example package contains some temperature analysis.

Create your package by running

usethis::create_package("path/to/package/temperatureBerlin")

This will create a new folder, initialize an R package project in there and open the project in a new RStudio session.

Have a look at your files pane (usually bottom right in RStudio) to see which files were already added.

During package development

Checkout the options that you have in the Build tab in RStudio (you might have to restart RStudio to see it). You can do everything described below either via the buttons in RStudio or by calling functions. During the entire package building process the devtools package is your friend and you can (and should)

frequently check your package by clicking the Check button or by calling devtools::check() (Keyboard shortcut Ctrl/Cmd + Shift + E)
test your functions after running devtools::load_all()
- this simulates installing and loading the package
- much better than testing your functions in your current environment
build the documentation with devtools::document() (Ctrl/Cmd + Shift + D)
- you have to do this once in the beginning, afterwards devtools::check() will also build the documentation for you
install and load the package and restart R with Ctrl/Cmd + Shift + B or with the Install button

You can already try this out now, because your project is already an (empty) R package.

Step 2: Initialize a Git repository

To initialize a Git repository for you package call

usethis::use_git()

Follow the promts in the console, commit your first files with the commit message ‘Initial commt’ and restart R Studio.

Step 3: Create a research compendium

Fill out the DESCRIPTION file

The DESCRIPTION file contains important metadata about your package and describes which other packages it needs to run correctly. If an R project contains a DESCRIPTION file, RStudio and devtools will consider it an R package.

There are a few conventions on how to fill out the fields of the DESCRIPTION file. Have a look at the R packages book to find out more about this.

For now, just look at how usethis already created a template for you. Fill out the dummy fields of the description file, e.g.

Package: temperatureBerlin
Type: Package
Title: Analyse Temperature Data From Berlin
Authors@R: 
    person(given = "Selina",
           family = "Baldauf",
           role = c("aut", "cre"),
           email = "selina.baldauf@fu-berlin.de")
Description: Analysis of the temperature data collected in Berlin from our
    research project xyz presented in paper abc.
License: `use_mit_license()`, `use_gpl3_license()` or friends to
    pick a license
Encoding: UTF-8
LazyData: true
Roxygen: list(markdown = TRUE)
RoxygenNote: 7.2.3

Add a license

Use the usethis package to add a license, e.g. for the GNU General Public License 3 (gpl-3) license you can use

usethis::use_gpl3_license()

This adds a file LICENSE.md with your name (or the name(s) specified in the DESCRIPTION file) in it to the project and automatically updates the license field in DESCRIPTION.

Add data

The correct folder for datasets you want to make available in your research compendium is the data/ folder. However, the convention of R packages is that you can only store binary .RData files in there. All data that is not in .RData format (e.g. csv files) can go into the data-raw/ folder and then they can be pre-processed and turned into .Rdata files from there.

Create a dataset with R

We first create a data-raw/ folder either manually or with

usethis::use_data_raw("temperature")

This will create a temperature.R file in data-raw/ which you can use to pre-process your raw data and then add it to the data/ folder as an .RData file.

In this R script, you can either create a fake data set using only R code, or you can read a csv file (or other data file), that you put into the data-raw/ folder before.

At the end of the script, you call the usethis::use_data() function to compress your data file and store it as .RData file in your package.

So, you can just put something like this in the temperature.R file and then run the script:

# create a dummy dataset for temperature measurements
temperature <- data.frame(id = 1:10, temp = rnorm(10, mean = 15, sd = 2))
# or alternatively, if you have temperature data as csv in data-raw/
temperature <- readr::read_csv("data-raw/temperature.csv")

# compress and store the data as .Rdata file
usethis::use_data(temperature, overwrite = TRUE)

Now have a look at the data/ folder in your project, where you will find a temperature.Rdata version of the data.

Add an analysis function

All R functions for your analysis go in the R/ folder. You can group related functions together in the same script but unrelated functions should go into separate scripts.

Here, we create a simple function that prints the mean of a vector. We save this function in a new R script called printMean.R. To create this script in the correct location, you can conveniently run

usethis::use_r("printMean")

This will will create, save and open the script for you.

Now add the following simple function to the script:

# a simple printer function for the mean of a vector
printMean <- function(x){
  print(paste0("The mean value is ", mean(x, na.rm = T), "!"))
}

If you forgot to do it, you can run devtools::check() now to see if everything is ok in your package. You will get a warning that you did not document your data and your function yet. This is what we will do in the next step.

Document function and data with `roxygen2`

You can document your functions and data using the package roxygen2. Roxygen2 will automatically generate .Rd documentation files in the man/ folder, update the NAMESPACE if you import or export functions and update the DESCRIPTION file if you include any functions from other packages.

The basic workflow is:

insert a roxygen skeleton for your function

either use the RStudio GUI Code | Insert Roxygen Skeleton or the keyboard shortcut Ctrl + Alt + Shift + R (on Windows)

call devtools::document() or use Ctrl/Cmd + Shift + D to generate the documentation in the man/ folder and update the NAMESPACE

To create a documentation for our simple printMean function, navigate to the respective R script, click inside the function and add a roxygen skeleton.

Fill out the skeleton by adding all the information about the function using the roxygen tags. E.g. your roxygen description could look like the following:

#' @title Print mean values in a sentence
#' @description This function calculates the mean of its input and prints it 
#'    into a sentence. The function is called in the analysis.Rmd file
#' @export
#' @param x A vector of numeric values of which to calculate the mean
#' 
#' @return The output of \code{\link{print}} after calculating the 
#'    \code{\link{mean}} and pasting it into a sentence
#'
#' @examples
#' printMean(c(1,2,3))
#'

To document datasets in your compendium create a new R script with the same name as your dataset in the R/ directory (in our case temperature.R) (remember usethis::use_r("temperature")?) and describe your data set before specifying the name of your dataset as a string. As a bare minimum this could look like the following:

#' @title Temperature data from place X
#' @description Temperature measurements (2m height) measured on March 21st, 
#'     2009 at place x
#' @format A \code{data.frame} with 2 columns, which are:
#' \describe{
#' \item{id}{id of the sensor}
#' \item{temp}{temperature in °C}
#'}
"temperature"

Unfortunately, for data sets you can’t automatically insert a preset roxygen skeleton.

Tip

If you are too lazy to document your dataset by hand, you can use the sinew package which is very convenient with package documentation using roxygen2. Install the package with

# install.packages("remotes")
remotes::install_github('yonicd/sinew')

and then run

sinew::makeOxygen(obj = temperature)

This will print you a perfect, pre-made skeleton for the data to the console, and you can just copy paste it to your temperature.R file and then fill out the placeholders.

If you now run devtools::document() or rebuild your package, you will find an automatically generated read-only file printMean.Rd and temperature.Rd in the man/ folder of your research compendium. If you now call ?printMean or ?dummyTempData (after installing and restarting) you open the help file just like with any other R function you know.

Please also have a look at the conventions and guidelines on roxygen2-documentation and the quick reference of the most common tags.

Add an analysis script

We will now add an analysis script in the form of a Quarto document (could also be an R Script instead). Here we write some analysis using the data and functions that we just defined.

Add an /analysis folder with an quarto document report.qmd to your compendium. This can for example look like this:

---
title: "Temperature Analyis"
author: "Selina Baldauf"
date: "18 November, 2025"
---

```{r setup, include=FALSE}
devtools::load_all(".")
```

# This is the temperature analysis

```{r}
printMean(dummyTempData$temp)
```

With the devtools::load_all(".") command, we tell the document to first load all functions included in the package before running the code chunks.

Step 4: Version control with Git and GitHub

Commit all changes

It’s time to commit all the changes to our package. There are many options how to do this. If you already use Git, use your personal favorite.

One simple option is to look for the Git pane (next to Build) in R Studio, then stage all created files (by selecting them in the checkbox) and then to click commit. You are asked to enter a commit message and then you can click on ‘Commit’ to actually make the commit.

Alternatively, you could do this in the terminal:

git add .
git commit -m "First package version"

Make the compendium a GitHub repository

To create a GitHub repository for your package, you can conveniently use the usethis package:

usethis::use_github()

Follow the promtpts in the console and this will initialize a new GitHub repository from your compendium and add the url to DESCRIPTION.

If you do this for the first time it is possible that you run into some issues if you don’t have a correct setup yet. But the usethis package will help you through all the steps with its error messages. One thing you definitely need for this step to work is a PAT (personal acces token) for GitHub added to your .Renviron file. Check if you have one or create one with usethis::browse_github_token(). You can find a description of how to get one in the beginning of this how-to.

Step 5: Add a readme to your github project

Add a README.md to your package that will be rendered on the webpage of your GitHub repository.

usethis::use_readme_md()

In there, you can describe your package and add information on how to download and get started with it. Now you have to commit and push your changes to Github for them to be effective. You can use the same commit strategy as described above for R Studio. Then you can click on the green push button to publish your changes to Github.

Now your package is published to Github and can be downloaded and installed by others 🎉

Below you find some extra steps you can take to include more functionality in your package.

Extras

How to include large external data sets with `piggyback`

If you use an R package as a research compendium, you might also want to share large datasets along with it. The problem is, that you cannot manage those large files anymore by tracking them with git and commiting directly to GitHub. However, GitHub allows you to attach up to 2 GB of files to each release of your project. The piggyback package takes advantage of this and allows you to create GitHub releases, upload and download large datasets associated with your compendium.

First, we simulate a large datafile by creating a dataset in our data-raw folder (if you have really large file you would not save them as .txt but rather compress them first).

largeDataSet <- data.frame(id = 1:1e6, temp = rnorm(1e6,15,2))
write.table(largeDataSet, "data-raw/largeDataSet.txt")

To use the piggyback package, you first need to install it on your machine with install.packages("piggyback"). To upload our dataset, we first need to create a new release on GitHub. The first release has to be created on the GitHub website, by clicking on Create a new release, giving it a Tag (e.g. v0.0.1) and a title. You will see that the release contains your source code as zip and as a tarball (.tar.gz).

With piggyback you can always list your existing releases by running

piggyback::pb_list()

Now that you created your first release, you can create new releases directly from R with:

piggyback:pb_new_release("v0.0.2")

To track your dataset (which means that every time the dataset changes you can create a new release), call

piggyback::pb_track("data-raw/largeDataSet.txt")

To piggyback your dataset to your package, you now simply have to run

piggyback::pb_upload("data-raw/largeDataSet.txt")

Now you can find it attached to your latest release.

If you want to use the large dataset in your package for the analysis or to generate the report, you can write a function to download the data using piggyback::pb_download. Just create a new file loadData.R in your R/ directory and add the following function.

loadData <- function(){
  piggyback::pb_download(repo = "selinaZitrone/temperture")
}

When this function is called, it will automatically download all the assets attached to the latest release of the repository and put them in data-raw/. Now the data can be accessed by all the package functions. Have a look at ?piggyback::pb_download to see which parameters you can specify. You don’t have to download all the assets and you can also chose which version of the release should be used for the download.

You can also create new releases directly from piggyback and you can track large datasets by creating new releases if the datasets change. This will not be covered here, but if you are interested have a look at the piggyback documentation.

Unit testing

With unit testing, you can test the individual functions in your package to make them more reliable. If you want to read about unit testing, you can check out the testing chapter in the Rpkg book.

First, create the infrastructure needed by the thestthat package.

usethis::use_testthat()

This creates a tests/ folder which contains a testthat/ folder and an R script testthat.R. You do not need to change anything in the testthat.R script. It is called e.g. by rcmdcheck::rcmdcheck() to run all the tests you created in your package. The testthat/ folder is the place to store the R scripts with your actual tests. Also, the testthat package is automatically added to the Suggests field in DESCRIPTION.

Now you can add a test script that tests your simple printer function. To create a test script for a specific function, open the R-script with the function and then initialize a basic test file for the script by calling

usethis::use_test()

The testfile can be found in /tests/testthat. Testfiles have to be named test-xyz.R in order to be found (usethis::use_test() does the naming for you - so you don’t have to worry about it).

We can write a very simple test function for our printMean function that checks if it works correctly

test_that("printer function works", {
  expect_equal(printMean(c(1,2,3)), "The mean value is 2!")
})

If you run the test and you don’t get any errors in the console, then it means that your test was successful. If not you will get a helpful error message telling you what went wrong.

If you want to run all the tests from your package at once to see if they pass your can run

devtools::test()

Implement a continuous integration workflow with GitHub Actions

GitHub actions allows you to run code on GitHub servers, everytime you push something there. This can e.g. be used to run the R CMD CHECK that checks your package. This way you can automatically make sure that your package is build without errors. Otherwise you will get notified by Github.

Before you implement this, commit all the recent changes and push them to Github.

Then create the basic file structure needed to set up a workflow by calling

usethis::use_github_actions()

This creates a .github folder, in which you find a .gitignore file and a workflows/ folder. In this folder you can store all your GitHub actions in .yaml file format. usethis automatically added a first GitHub action called R-CMD-check.yaml. This action is triggered on every push or pull request on the master/main branch and checks your package to make sure that new code does not break the package.

If you want to add other actions that already are available (e.g. from the GitHub repository r-lib/actions) you can do the following:

usethis::use_github_action("render-readme.yaml")

This adds the respective yaml-file to .github/workflows. Of course you can also write your own actions and simply add them to the workflows/ folder.

Add an RStudio server with `holepunch`

There is a nice intro for holepunch: Getting started with holepunch. The code below is copied’n’pasted from the holepunch repo’s readme.

remotes::install_github("karthik/holepunch")

holepunch::write_dockerfile(maintainer = "your_name") 
# To write a Dockerfile. It will automatically pick the date of the last 
# modified file, match it to that version of R and add it here. You can 
# override this by passing r_date to some arbitrary date
# (but one for which a R version exists).

holepunch::generate_badge() # This generates a badge for your readme.

# ----------------------------------------------
# At this time 🙌 push the code to GitHub 🙌
# ----------------------------------------------

# And click on the badge or use the function below to get the build 
# ready ahead of time.
holepunch::build_binder()
# 🤞🚀

Some more tips

You can use devtools::spell_check() to test your entire package for spelling errors. This is very useful if you have reports or other text intensive files in there. By default, the spell check will be for American English. However, you can change the langauage by adding a Language field to your DESCRIPTION file.
If you want to write a new function and generate the corresponding R script in the R/ directory, you can conveniently call usethis::use_r("functionName"). This will create a file functionName.R ind the R/ directory and open it for you.
Use the goodpractice package to further test your package for parts which can be improved. Running goodpractice::gp(){R} will tell you where your comment lines are too long, where you should consider renaming variables, if something is wrong with your NAMESPACE, …

Links and further reading

The Turing Way website is a very useful guide to reproducible data science

About Reasearch compendia

A guide to making your data analysis more reproducible: Nice talks and other helpful links and information on research compendia by Karthik Ram
The Turing Way guide’s section on reseach compendia
Ressources collection on research compendia

R packages

Comprehensive book on how to create an R package by Hadley Wickham and Jennifer Bryan
usethis: workflow package to automate tasks like project setup etc.
- pkgdown: build a quick and easy website for your package
- goodpractice: gives you advice about good practices when building R packages

Git and GitHub

Book an using Git and Github with R and R Studio
Getting started with Github Actions
Talk by Jim Hester on GitHub Actions with R
Github Action Examples with R

Reproducible environments

The Holepunch package for R and Binder (adds a live, on-demand RStudio server to your R project on GitHub/GitLab)
Introduction to Docker for R

Scientific publications

James, Justin. 2012. “10 Classic Mistakes That Plague Software Development Projects.” 2012. https://www.techrepublic.com/blog/10-things/10-classic-mistakes-that-plague-software-development-projects/.

Scheller, Robert M, Brian R Sturtevant, Eric J Gustafson, Brendan C Ward, and David J Mladenoff. 2010. “Increasing the Reliability of Ecological Models Using Modern Software Engineering Techniques.” Frontiers in Ecology and the Environment 8 (5): 253–60. https://doi.org/10.1890/080141.

Baxter, Susan M., Steven W. Day, Jacquelyn S. Fetrow, and Stephanie J. Reisinger. 2006. “Scientific Software Development Is Not an Oxymoron.” PLoS Computational Biology 2 (9): 0975–78. https://doi.org/10.1371/journal.pcbi.0020087.

Merali, Zeeya. 2010. “Computational Science: …Error.” Nature 467 (7317): 775–77. https://doi.org/10.1038/467775a.

Storer, Tim. 2017. “Bridging the Chasm: A Survey of Software Engineering Practice in Scientific Programming.” ACM Computing Surveys 50 (4). https://doi.org/10.1145/3084225.

Marwick, Ben, Carl Boettiger, and Lincoln Mullen. 2018. “Packaging Data Analytical Work Reproducibly Using R (and Friends).” American Statistician 72 (1): 80–88. https://doi.org/10.1080/00031305.2017.1375986.