Before we start

Before you start the how-to, please check that you have the following things set up and running:

GitHub Account
download and install git on your machine
install the following R-packages (if you don’t have them already): devtools, usethis, testthat
add a PAT (personal access token) for GitHub to your .Renviron file.
- find a good step by step guide here
- or run usethis::browse_github_token() in the console for a step by step guide

Step 1: Create an empty R-package

There are different ways to create a package in R. We based this How-To on the second method using the usethis package.

Option 1: Using the RStudio GUI

Go to File | New Project | New Directory | R Package.

Give your package a name that is accepted as an R-package name:

only ASCII letters, numbers and ‘.’
at least two characters
start with a letter
not end with ‘.’

Make sure to check the Create a git repository checkmark to initialize the package with a git repository and a .gitignore file.

If you accidentally created the package without a git repository you can either use usethis::use_git() or you can initialize an empty git repository from the terminal git init.

Option 2: Using `usethis::create_package()`

Another possibility is to create a package by running usethis::create_package("path/to/package/packageName"). If you leave all the other function parameters at their default value, this command will create a new directory (if it doesn’t already exist) and initialize an R package in that directory. Compared to the GUI method, this R package by default is already configured to using the roxygen2 package for package documentation (see later) and will not have any dummy files in it.

However, with this method you have to seperately initialize a git repository by running usethis::use_git() or git init from the terminal.

During package development

Checkout the options that you have in the Build tab in RStudio. You can do everything described below via the GUI or by calling R functions.

During the entire package building process the devtools package is your friend and you can (and should)

frequentyl check your package by hitting the Check button in the Build panel or by calling devtools::check() (Keyboard shortcut Ctrl/Cmd + Shift + E)
test your functions after running devtools::load_all()
- this simulates installing and loading the package
- much better than testing your functions in your current environment
build the documentation with devtools::document() (Ctrl/Cmd + Shift + D)
- you have to do this once inthe beginning, afterwards devtools::check() will also build the documentation for you
install and load the package and restart R with Ctrl/Cmd + Shift + B

Step 2: Create a research compendium

Fill out the DESCRIPTION file

The DESCRIPTION file contains important metadata about your package and describes which other packages it needs to run correctly. If an R project contains a DESCRIPTION file, RStudio and devtools will consider it an R package.

Fill out the fields of the description file, e.g.

Package: YoMos2020
Type: Package
Title: Analyse Temperature Data From Somewhere
Authors@R: 
    person(given = "Selina",
           family = "Baldauf",
           role = c("aut", "cre"),
           email = "selina.baldauf@fu-berlin.de")
Description: Analysis of the temperature data collected in Berlin from our
    research project xyz presented in paper abc.
License: `use_mit_license()`, `use_gpl3_license()` or friends to
    pick a license
Encoding: UTF-8
LazyData: true
Roxygen: list(markdown = TRUE)
RoxygenNote: 7.1.0

There are a few conventions on how to fill out the fields of the DESCRIPTION file. Please have a look in the R packages book to find out more about this file. Also have a look at how usethis created the template for you. This will already give you a hint on how to fill out the fields.

Add a license

If you want to add a specific license to your project you can use different functions from the usethis package, e.g. for the GNU General Public License 3 (gpl-3) license you can use

usethis::use_gpl3_license()

This adds LICENSE.md with your name (or the name(s) specified in the DESCRIPTION file) in it to the project and automatically updates the license field in DESCRIPTION.

Add data

A common place for datasets that you want to make available in your research compendium is the /data folder. The convention of R packages is that you should only story .RData files in there. Everything that is not in .RData format can go in the /data-raw folder.

Create a dataset with R

First create a /data-raw folder by calling usethis::use_data_raw(). This will also create a DATASET.R file in which you can create your data and then add it to the /data folder as an .RData file. By default these files are compressed using bzip2 but you can also chose other compressions by changing the compress parameter of the function. Just put the following lines in the DATASET.R file and run them:

# create a dummy dataset for temperature measurements
dummyTempData <- data.frame(id = 1:10, temp=rnorm(10, mean = 15, sd = 2))
usethis::use_data(dummyTempData, overwrite = TRUE)

Add an analysis function

Save all the functions you need for your analysis in the R/ folder. You can group related functions together in the same R-script but unrelated functions should go in separate scripts. Here, we create a simple printer function that prints the mean of a vector. We save this function in a new R-Script called printMean.R. To create the script, you can conveniently use usethis::use_r("printMean") which will create, save and open the script for you. Add the following simple function:

# a simple printer function for the mean of a vector
printMean <- function(x){
  print(paste0("The mean value is ", mean(x, na.rm = T), "!"))
}

Document function and data with roxygen2

You can document your functions and data using the package roxygen2. This allows you to describe your functions in the same place where you define them. Roxygen2 will automatically generate .Rd files in the /man folder, update the NAMESPACE if you import or export functions and update the DESCRIPTION file. The basic workflow is:

insert a roxygen skeleton above your function
- either via the RStudio GUI Code| Insert Roxygen Skeleton or with the keyboard shortcut Ctrl + Alt + Shift + R (on Windows)
call devtools::document() to generate the documentation in the /man folder and update the NAMESPACE

[If you created your package via the GUI and not with usethis::create_package() you need to configure the build tools to use roxygen2. In order to do this navigate to Build -> Configure Build Tools and hit the checkmark at Generate documentation with Roxygen. Click Configure and make sure all the checkmarks are checked so that roxygen generates all help files and the documentation is rebuild everytime your install and restart the package. Before this works correctly, you have to delete all alld documenation files that were not generated by roxygen.If you forget to do this, you will get a warning when your rebuild or check your package, so be careful to read the warnings if there are any. ]

To create a documentation for our simple printMean function, navigate to the respective R script and add a roxygen skeleton.

Now you can fill out the skeleton by adding all the information about the function that you want to appear in the help file for that function using tags. E.g. your roxygen description could look like the following:

#' @title Print mean values in a sentence
#' @description This function calculates the mean of its input and prints it into a sentence. The function
#' is called in the analysis.Rmd file
#' @export
#' @param x A vector of numeric values of which to calculate the mean
#' 
#' @return The output of \code{\link{print}} after calculating the \code{\link{mean}} and pasting it
#' into a sentence
#'
#' @examples
#' printMean(c(1,2,3))
#'

To document datasets in your compendium open up a new R script with the name of your dataset in the /R directory (in our case dummyTempData.R) (remember usethis::use_r("dummyTempData")?) and describe your data set before specifying the name of your dataset as a string. As a bare minimum this could look like the following:

#' @title Temperature data from place X
#' @description Temperature measurements (2m height) measured on March 21st, 2009 at place x
#' @format A \code{data.frame} with 2 columns, which are:
#' \describe{
#' \item{id}{id of the sensor}
#' \item{temp}{temperature in °C}
#'}
"dummyTempData"

Unfortunately, here you can’t insert a preset roxygen skeleton, because a data set is not a function. However, if you are too lazy to document your dataset by hand, you can use the sinew package which is very convenient with package documentation using roxygen2. To get a pre-made skeleton for the data set just run sinew::makeOxygen(obj = dummyTempData) and you will get a perfect skeleton printed to the console, that you can copy paste to your dummyTempData.R.

If you now run devtools::document() or rebuild your package, you will find an automatically generated read only file printMean.Rd and dummyTempData.Rd in the /man folder of your research compendium. If you call ?printMean or ?dummyTempData (after installing and restarting) you open the help file just like with any other R function you know.

Please also have a look at the conventions and guidelines on roxygen2-documentation the quick reference of the most common tags.

Add an analysis script

Add an /analysis folder with an rmarkdown document report.Rmd to your compendium. Depending on the type of report you want to share you have to edit the yaml header of your report. It can e.g. look something like this:

---
title: "Temperature Analyis"
author: "Selina Baldauf"
date: "28 June, 2021"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
devtools::load_all(".")
```

# This is the temperature analysis

```{r}
printMean(dummyTempData$temp)
```

Step 3: Version control with git and GitHub

Commit all changes

In the Git tab (next to Build) you can first stage all created files and then commit them or you can do it in the terminal:

git add .
git commit -m "First package version"

Make the compendium a GitHub repository

For this, you can conveniently use the usethis package:

usethis::use_github()

This will initialize a new GitHub repository from your compendium and will add the url to DESCRIPTION.

If you do this for the first time it is possible that you run into some issues because you might not have set up everything correctly, yet. But the usethis package will help you through all the steps with its error messages. One thing you definitely need for this step to work is a PAT (personal acces token) for GitHub added to your .Renviron file. Check if you have one or create one with usethis::browse_github_token().

Add a readme to your github project

Add a README.md to your package that will be rendered on the webpage of your GitHub repository.

usethis::use_readme_md()

In there, you can describe your package and add some information on how to download it and how to get started with the package. You can also do this on the GitHub webpage of your repository.

Brief digression: Include large external data sets with `piggyback`

If you use an R package as a research compendium, you might also want to share large datasets along with it. The problem is, that you cannot manage those large files anymore by tracking them with git and commiting directly to GitHub. However, GitHub allows you to attach up to 2 GB of files to each release of your project. The piggyback package takes advantage of this and allows you to create GitHub releases, upload and download large datasets associated with your compendium.

First, we simulate a large datafile by creating a dataset in our data-raw folder (if you have really large file you would not save them as .txt but rather compress them first).

largeDataSet <- data.frame(id = 1:1e6, temp = rnorm(1e6,15,2))
write.table(largeDataSet, "data-raw/largeDataSet.txt")

To use the piggyback package, you first need to install it on your machine with install.packages("piggyback"). To upload our dataset, we first need to create a new release on GitHub. The first release has to be created on GitHub, by clicking on Create a new release, giving it a Tag (e.g. v0.0.1) and a title. You will see that the release contains your source code as zip and as a tarball (.tar.gz).

With piggyback you can always list your existing releases by running

piggyback::pb_list()

You can also create a new release directly from R by calling

piggyback:pb_new_release("v0.0.2")

To track your dataset (which means that everytime the dataset changes you can create a new release), call

piggyback::pb_track("data-raw/largeDataSet.txt")

To piggyback your dataset to your package, you now simply have to run

piggyback::pb_upload("data-raw/largeDataSet.txt")

Now you can find it attached to your latest release.

If you want to use the large dataset in your package for the analysis or to generate the report, you can write a function to download the data using piggyback::pb_download. Just create a new file loadData.R in your /R directory and add the following function.

loadData <- function(){
  piggyback::pb_download(repo = "selinaZitrone/YoMos2020")
}

If this function is called, it will automatically download all the assets attached to the latest release of the repository and put them in a /data-raw folder. Now the data can be accessed by all the package functions. Have a look at ?piggyback::pb_download to see which parameters you can specify. You don’t have to download all the assets and you can also chose which version of the release should be used for the download.

You can also create new releases directly from piggyback and you can track large datasets by creating new releases if the datasets change. This will not be covered here, but if you are interested have a look at the piggyback documentation.

Step 4: Unit testing

First, create the infrastructure needed by the thestthat package.

usethis::use_testthat()

This creates a /tests folder which contains a /testthat folder and an R-script called testthat.R. You do not need to change anything in the testthat.R script. This is called e.g. by rcmdcheck::rcmdcheck() to run all the tests you created in your package. The /testthat folder is the place to store the R-scripts with the actual tests. Also, the testthat package ist now added to the Suggests field in DESCRIPTION.

Now you can add a test script that tests your simple printer function. To create a test script for a specific function, open the R-script with the function and then initialize a basic test file for the script by calling

usethis::use_test()

The testfile can be found in /tests/testthat. Testfiles have to be named test-xyz.R in order to be found (usethis::use_test() does the naming for you - so you don’t have to worry about it).

We can write a very simple test function for our printMean function that checks if it works correctly

test_that("printer function works", {
  expect_equal(printMean(c(1,2,3)), "The mean value is 2!")
})

If you run the test and you don’t get any errors in the console, then it means that your test was successful. If not you will get a helpful error message telling you what went wrong.

If you want to run all the tests from your package at once to see if they pass your can run

devtools::test()

Step 5: CI with GitHub Actions

Before we start, it’s time to commit all the changes to our package. Use the Git panel or run

git add .
git commit -m "Add unit tests for printMean function"

Then create the basic file structure needed to set up a workflow by calling

usethis::use_github_actions()

This creates a .github folder, in which you find a .gitignore file and a /workflows folder. In this folder you can store all your GitHub actions in .yaml file format. The usethis::use_github_actions() function automatically added a first GitHub action called R-CMD-check.yaml. This action is triggered on every push or pull request on the master branch and checks your package to make sure that new code does not break the package.

If you want to add other actions that already are available (e.g. from the GitHub repository r-lib/actions) you can do the following:

usethis::use_github_action("render-readme.yaml")

This adds the respective yaml-file to ./github/workflows. Of course you can also write your own actions and simply add them to the /workflows folder.

Add RStudio server with `holepunch`

There is a nice intro for holepunch: Getting started with holepunch. The code below is copied’n’pasted from the holepunch repo’s readme.

remotes::install_github("karthik/holepunch")

write_dockerfile(maintainer = "your_name") 
# To write a Dockerfile. It will automatically pick the date of the last 
# modified file, match it to that version of R and add it here. You can 
# override this by passing r_date to some arbitrary date
# (but one for which a R version exists).

generate_badge() # This generates a badge for your readme.

# ----------------------------------------------
# At this time 🙌 push the code to GitHub 🙌
# ----------------------------------------------

# And click on the badge or use the function below to get the build 
# ready ahead of time.
build_binder()
# 🤞🚀

[Be carful here: Since October 2020, GitHub has renamed its default branch for repositories from master to main. If you have a repository with main instead of master as default branch, holepunch will throw an error when trying to connect Binder to your GitHub repository. If this happens make sure that your replace master with main in the .binder/Dockerfile URL and in the URL used in the Readme badge]

Some more tips

You can use devtools::spell_check() to test your entire package for spelling errors. This is very useful if you have reports or other text intensive files in there. By default, the spell check will be for American English. However, you can change the langauage by adding a Language field to your DESCRIPTION file.
If you want to write a new function and generate the corresponding R script in the /R directory, you can conveniently call usethis::use_r("functionName"). This will create a file functionName.R ind the /R directory and open it for you.
Use the goodpractice package to further test your package for parts which can be improved. Running goodpractice::gp().{R} will tell you where your comment lines are too long, where you should consider renaming variables, if something is wrong with your NAMESPACE, …

Links and further reading

The Turing Way a very useful guide to reproducible data science

https://the-turing-way.netlify.app/introduction/introduction.html

Ropensci’s reproducibility guide

https://ropensci.github.io/reproducibility-guide/

Reasearch compendia

A guide to making your data analysis more reproducible (nice talk and more by Karthik Ram)

https://github.com/karthik/rstudio2019

The Turing Way guide’s section on reseach compendia

https://the-turing-way.netlify.app/research_compendia/research_compendia.html

Ressources collection on research compendia

https://research-compendium.science/

R packages

Comprehensive book on how to create an R package

https://r-pkgs.org/

Git and GitHub

CI with github actions

Jim Hester on GitHub Actions with R
Getting started with GitHub Actions
Examples for Actions with R

How to use an .Rmd document as Readme in github and render it automatically

https://www.r-bloggers.com/rendering-your-readme-with-github-actions/ https://www.edwardthomson.com/blog/github_actions_advent_calendar.html

Unit testing

Intro to unit testing in R
Testing with testthat

Holepunch package for R and Binder (adds a live, on-demand RStudio server to your R project on GitHub/GitLab)

https://karthik.github.io/holepunch/

Introduction to Docker for R

https://colinfay.me/docker-r-reproducibility/

Scientific publications

James, Justin. 2012. “10 Classic Mistakes That Plague Software Development Projects.” 2012. https://www.techrepublic.com/blog/10-things/10-classic-mistakes-that-plague-software-development-projects/.

Scheller, Robert M, Brian R Sturtevant, Eric J Gustafson, Brendan C Ward, and David J Mladenoff. 2010. “Increasing the Reliability of Ecological Models Using Modern Software Engineering Techniques.” Frontiers in Ecology and the Environment 8 (5): 253–60. https://doi.org/10.1890/080141.

Baxter, Susan M., Steven W. Day, Jacquelyn S. Fetrow, and Stephanie J. Reisinger. 2006. “Scientific Software Development Is Not an Oxymoron.” PLoS Computational Biology 2 (9): 0975–78. https://doi.org/10.1371/journal.pcbi.0020087.

Merali, Zeeya. 2010. “Computational Science: …Error.” Nature 467 (7317): 775–77. https://doi.org/10.1038/467775a.

Passig, Kathrin, and Johannes Jander. 2013. Weniger Schlecht Programmieren. O’Reilly.

Storer, Tim. 2017. “Bridging the Chasm: A Survey of Software Engineering Practice in Scientific Programming.” ACM Computing Surveys 50 (4). https://doi.org/10.1145/3084225.

Marwick, Ben, Carl Boettiger, and Lincoln Mullen. 2018. “Packaging Data Analytical Work Reproducibly Using R (and Friends).” American Statistician 72 (1): 80–88. https://doi.org/10.1080/00031305.2017.1375986.

How to create an R package with your data analysis