<- c("devtools", "usethis", "roxygen2")
packages_essential # install the essential packages
install.packages(setdiff(packages_essential, rownames(installed.packages())))
<- c("goodpractice", "testthat", "piggyback")
packages_optional # install the optional packages
install.packages(setdiff(packages_optional, rownames(installed.packages())))
Research compenia as R packages
2023-07-20
Description
A research compendium is a collection of all the digital parts of your research project (data, code, documents) which is often required to be published alongside a scientific paper. The structure and tools of R packages provide a standard workflow for building such reproducible research compendia. In this lecture, you will learn how to build, test and publish your research compendia in the form of an R package. This lecture is mainly aimed at people that have some experience with R, R-Studio and Git. But even if you are new to these topics, you will still get a valuable overview of key concepts for reproducible research compendia and tools to build them.
How to build your own research compendium - Step by step
If you find any issues or problems in this how-to, feel free to let me know via email or create an issue on Github. Thanks đ
Preparation
Before you start with the how-to, please check that you have the following things set up and running.
R packages needed
First, you need to install some packages. Some of them are essential for the core workflow, others are optional for some more advanced features that are described later in the how-to.
You can run the following code to install the packages that you still need:
Set up Git and Github
You only need this if you want to use version control with your research compendium, publish on Github and use Github actions to do automated testing of your package. Otherwise, you can skip this step and all the steps in the following guide that involve Git and Github.
To set everything up
- Download and install Git on your machine
- Create a GitHub Account
Personal Access token to use with R
Now, you can add a personal access token (PAT) to your .Renviron
file. This way, you can be identified to your Github account when you communicate with Github from R.
To set up you PAT call:
::create_github_token() usethis
This will open Github in the browser (you might need to enter your password).
On top, enter a note on what you want to use the token for (e.g. RStudio on my work computer). Then select an appropriate expiration date (90 days is fine in most cases - Github will send you an email when the token is about to expire so you can renew it).
Scroll over the scopes, select ârepoâ, âuserâ, âworkflowâ (usually this is already the default setting) and then click on âGenerate tokenâ at the bottom.
Leave the browser page open, copy the generated token to your clipboard and go back to RStudio. Add your PAT to the .Renviron
file (donât do this if you share your computer with other people or they can access your Github account) by calling:
::gitcreds_set() gitcreds
Follow the prompt in the console and enter your PAT. Now you should be able to work with Github.
Install RTools (only on Windows)
If you are on a windows machine and you want to build R packages, you need to install RTools which contains the toolchains for building packages.
You can check if you already have Rtools installed with
# Returns TRUE if Rtools is correctly installed
::find_rtools() devtools
If you need to install it, you can download Rtools from here. Chose the appropriate version for your version of R and follow the installation process.
Step 1: Create an empty R package
Think of a name for you R package that respects the following rules:
- only ASCII letters, numbers and â.â
- at least two characters
- starts with a letter
- does not end with â.â
In this example, I chose temperatureBerlin
because my example package contains some temperature analysis.
Create your package by running
::create_package("path/to/package/temperatureBerlin") usethis
This will create a new folder, initialize an R package project in there and open the project in a new RStudio session.
Have a look at your files pane (usually bottom right in RStudio) to see which files were already added.
During package development
Checkout the options that you have in the Build
tab in RStudio (you might have to restart RStudio to see it). You can do everything described below either via the buttons in RStudio or by calling functions. During the entire package building process the devtools
package is your friend and you can (and should)
- frequently check your package by clicking the
Check
button or by callingdevtools::check()
(Keyboard shortcut Ctrl/Cmd + Shift + E) - test your functions after running
devtools::load_all()
- this simulates installing and loading the package
- much better than testing your functions in your current environment
- build the documentation with
devtools::document()
(Ctrl/Cmd + Shift + D)- you have to do this once in the beginning, afterwards
devtools::check()
will also build the documentation for you
- you have to do this once in the beginning, afterwards
- install and load the package and restart R with Ctrl/Cmd + Shift + B or with the
Install
button
You can already try this out now, because your project is already an (empty) R package.
Step 2: Initialize a Git repository
To initialize a Git repository for you package call
::use_git() usethis
Follow the promts in the console, commit your first files with the commit message âInitial commtâ and restart R Studio.
Step 3: Create a research compendium
Fill out the DESCRIPTION file
The DESCRIPTION
file contains important metadata about your package and describes which other packages it needs to run correctly. If an R project contains a DESCRIPTION
file, RStudio and devtools will consider it an R package.
There are a few conventions on how to fill out the fields of the DESCRIPTION
file. Have a look at the R packages book to find out more about this.
For now, just look at how usethis
already created a template for you. Fill out the dummy fields of the description file, e.g.
Package: temperatureBerlin
Type: Package
Title: Analyse Temperature Data From Berlin
Authors@R:
person(given = "Selina",
family = "Baldauf",
role = c("aut", "cre"),
email = "selina.baldauf@fu-berlin.de")
Description: Analysis of the temperature data collected in Berlin from our
research project xyz presented in paper abc.`use_mit_license()`, `use_gpl3_license()` or friends to
License:
pick a license
Encoding: UTF-8
LazyData: true
Roxygen: list(markdown = TRUE) RoxygenNote: 7.2.3
Add a license
Use the usethis
package to add a license, e.g. for the GNU General Public License 3 (gpl-3) license you can use
::use_gpl3_license() usethis
This adds a file LICENSE.md
with your name (or the name(s) specified in the DESCRIPTION
file) in it to the project and automatically updates the license field in DESCRIPTION
.
Add data
The correct folder for datasets you want to make available in your research compendium is the data/
folder. However, the convention of R packages is that you can only store binary .RData
files in there. All data that is not in .RData
format (e.g. csv files) can go into the data-raw/
folder and then they can be pre-processed and turned into .Rdata
files from there.
Create a dataset with R
We first create a data-raw/
folder either manually or with
::use_data_raw("temperature") usethis
This will create a temperature.R
file in data-raw/
which you can use to pre-process your raw data and then add it to the data/
folder as an .RData
file.
In this R script, you can either create a fake data set using only R code, or you can read a csv file (or other data file), that you put into the data-raw/
folder before.
At the end of the script, you call the usethis::use_data()
function to compress your data file and store it as .RData
file in your package.
So, you can just put something like this in the temperature.R
file and then run the script:
# create a dummy dataset for temperature measurements
<- data.frame(id = 1:10, temp = rnorm(10, mean = 15, sd = 2))
temperature # or alternatively, if you have temperature data as csv in data-raw/
<- readr::read_csv("data-raw/temperature.csv")
temperature
# compress and store the data as .Rdata file
::use_data(temperature, overwrite = TRUE) usethis
Now have a look at the data/
folder in your project, where you will find a temperature.Rdata
version of the data.
Add an analysis function
All R functions for your analysis go in the R/
folder. You can group related functions together in the same script but unrelated functions should go into separate scripts.
Here, we create a simple function that prints the mean of a vector. We save this function in a new R script called printMean.R
. To create this script in the correct location, you can conveniently run
::use_r("printMean") usethis
This will will create, save and open the script for you.
Now add the following simple function to the script:
# a simple printer function for the mean of a vector
<- function(x){
printMean print(paste0("The mean value is ", mean(x, na.rm = T), "!"))
}
If you forgot to do it, you can run devtools::check()
now to see if everything is ok in your package. You will get a warning that you did not document your data and your function yet. This is what we will do in the next step.
Document function and data with roxygen2
You can document your functions and data using the package roxygen2
. Roxygen2 will automatically generate .Rd
documentation files in the man/
folder, update the NAMESPACE
if you import or export functions and update the DESCRIPTION
file if you include any functions from other packages.
The basic workflow is:
- insert a roxygen skeleton for your function
- either use the RStudio GUI
Code | Insert Roxygen Skeleton
or the keyboard shortcutCtrl + Alt + Shift + R
(on Windows)
- call
devtools::document()
or useCtrl/Cmd + Shift + D
to generate the documentation in theman/
folder and update theNAMESPACE
To create a documentation for our simple printMean
function, navigate to the respective R script, click inside the function and add a roxygen skeleton.
Fill out the skeleton by adding all the information about the function using the roxygen tags. E.g. your roxygen description could look like the following:
#' @title Print mean values in a sentence
#' @description This function calculates the mean of its input and prints it
#' into a sentence. The function is called in the analysis.Rmd file
#' @export
#' @param x A vector of numeric values of which to calculate the mean
#'
#' @return The output of \code{\link{print}} after calculating the
#' \code{\link{mean}} and pasting it into a sentence
#'
#' @examples
#' printMean(c(1,2,3))
#'
To document datasets in your compendium create a new R script with the same name as your dataset in the R/
directory (in our case temperature.R
) (remember usethis::use_r("temperature")
?) and describe your data set before specifying the name of your dataset as a string. As a bare minimum this could look like the following:
#' @title Temperature data from place X
#' @description Temperature measurements (2m height) measured on March 21st,
#' 2009 at place x
#' @format A \code{data.frame} with 2 columns, which are:
#' \describe{
#' \item{id}{id of the sensor}
#' \item{temp}{temperature in °C}
#'}
"temperature"
Unfortunately, for data sets you canât automatically insert a preset roxygen skeleton.
If you are too lazy to document your dataset by hand, you can use the sinew
package which is very convenient with package documentation using roxygen2
. Install the package with
# install.packages("remotes")
::install_github('yonicd/sinew') remotes
and then run
::makeOxygen(obj = temperature) sinew
This will print you a perfect, pre-made skeleton for the data to the console, and you can just copy paste it to your temperature.R
file and then fill out the placeholders.
If you now run devtools::document()
or rebuild your package, you will find an automatically generated read-only file printMean.Rd
and temperature.Rd
in the man/
folder of your research compendium. If you now call ?printMean
or ?dummyTempData
(after installing and restarting) you open the help file just like with any other R function you know.
Please also have a look at the conventions and guidelines on roxygen2-documentation and the quick reference of the most common tags.
Add an analysis script
We will now add an analysis script in the form of a Quarto document (could also be an R Script instead). Here we write some analysis using the data and functions that we just defined.
Add an /analysis
folder with an quarto document report.qmd
to your compendium. This can for example look like this:
---
title: "Temperature Analyis"
author: "Selina Baldauf"
date: "15 February, 2024"
---
```{r setup, include=FALSE}
devtools::load_all(".")
```
# This is the temperature analysis
```{r}
printMean(dummyTempData$temp)
```
With the devtools::load_all(".")
command, we tell the document to first load all functions included in the package before running the code chunks.
Step 4: Version control with Git and GitHub
Commit all changes
Itâs time to commit all the changes to our package. There are many options how to do this. If you already use Git, use your personal favorite.
One simple option is to look for the Git
pane (next to Build
) in R Studio, then stage all created files (by selecting them in the checkbox) and then to click commit. You are asked to enter a commit message and then you can click on âCommitâ to actually make the commit.
Alternatively, you could do this in the terminal:
git add .
git commit -m "First package version"
Make the compendium a GitHub repository
To create a GitHub repository for your package, you can conveniently use the usethis
package:
::use_github() usethis
Follow the promtpts in the console and this will initialize a new GitHub repository from your compendium and add the url to DESCRIPTION
.
If you do this for the first time it is possible that you run into some issues if you donât have a correct setup yet. But the usethis
package will help you through all the steps with its error messages. One thing you definitely need for this step to work is a PAT (personal acces token) for GitHub added to your .Renviron
file. Check if you have one or create one with usethis::browse_github_token()
. You can find a description of how to get one in the beginning of this how-to.
Step 5: Add a readme to your github project
Add a README.md to your package that will be rendered on the webpage of your GitHub repository.
::use_readme_md() usethis
In there, you can describe your package and add information on how to download and get started with it. Now you have to commit and push your changes to Github for them to be effective. You can use the same commit strategy as described above for R Studio. Then you can click on the green push button to publish your changes to Github.
Now your package is published to Github and can be downloaded and installed by others đ
Below you find some extra steps you can take to include more functionality in your package.
Extras
How to include large external data sets with piggyback
If you use an R package as a research compendium, you might also want to share large datasets along with it. The problem is, that you cannot manage those large files anymore by tracking them with git and commiting directly to GitHub. However, GitHub allows you to attach up to 2 GB of files to each release of your project. The piggyback
package takes advantage of this and allows you to create GitHub releases, upload and download large datasets associated with your compendium.
First, we simulate a large datafile by creating a dataset in our data-raw
folder (if you have really large file you would not save them as .txt but rather compress them first).
<- data.frame(id = 1:1e6, temp = rnorm(1e6,15,2))
largeDataSet write.table(largeDataSet, "data-raw/largeDataSet.txt")
To use the piggyback
package, you first need to install it on your machine with install.packages("piggyback")
. To upload our dataset, we first need to create a new release on GitHub. The first release has to be created on the GitHub website, by clicking on Create a new release
, giving it a Tag (e.g. v0.0.1) and a title. You will see that the release contains your source code as zip and as a tarball (.tar.gz).
With piggyback you can always list your existing releases by running
::pb_list() piggyback
Now that you created your first release, you can create new releases directly from R with:
:pb_new_release("v0.0.2") piggyback
To track your dataset (which means that every time the dataset changes you can create a new release), call
::pb_track("data-raw/largeDataSet.txt") piggyback
To piggyback your dataset to your package, you now simply have to run
::pb_upload("data-raw/largeDataSet.txt") piggyback
Now you can find it attached to your latest release.
If you want to use the large dataset in your package for the analysis or to generate the report, you can write a function to download the data using piggyback::pb_download
. Just create a new file loadData.R
in your R/
directory and add the following function.
<- function(){
loadData ::pb_download(repo = "selinaZitrone/temperture")
piggyback }
When this function is called, it will automatically download all the assets attached to the latest release of the repository and put them in data-raw/
. Now the data can be accessed by all the package functions. Have a look at ?piggyback::pb_download
to see which parameters you can specify. You donât have to download all the assets and you can also chose which version of the release should be used for the download.
You can also create new releases directly from piggyback
and you can track large datasets by creating new releases if the datasets change. This will not be covered here, but if you are interested have a look at the piggyback documentation.
Unit testing
With unit testing, you can test the individual functions in your package to make them more reliable. If you want to read about unit testing, you can check out the testing chapter in the Rpkg book.
First, create the infrastructure needed by the thestthat
package.
::use_testthat() usethis
This creates a tests/
folder which contains a testthat/
folder and an R script testthat.R
. You do not need to change anything in the testthat.R
script. It is called e.g. by rcmdcheck::rcmdcheck()
to run all the tests you created in your package. The testthat/
folder is the place to store the R scripts with your actual tests. Also, the testthat
package is automatically added to the Suggests field in DESCRIPTION
.
Now you can add a test script that tests your simple printer function. To create a test script for a specific function, open the R-script with the function and then initialize a basic test file for the script by calling
::use_test() usethis
The testfile can be found in /tests/testthat
. Testfiles have to be named test-xyz.R in order to be found (usethis::use_test()
does the naming for you - so you donât have to worry about it).
We can write a very simple test function for our printMean function that checks if it works correctly
test_that("printer function works", {
expect_equal(printMean(c(1,2,3)), "The mean value is 2!")
})
If you run the test and you donât get any errors in the console, then it means that your test was successful. If not you will get a helpful error message telling you what went wrong.
If you want to run all the tests from your package at once to see if they pass your can run
::test() devtools
Implement a continuous integration workflow with GitHub Actions
GitHub actions allows you to run code on GitHub servers, everytime you push something there. This can e.g. be used to run the R CMD CHECK that checks your package. This way you can automatically make sure that your package is build without errors. Otherwise you will get notified by Github.
Before you implement this, commit all the recent changes and push them to Github.
Then create the basic file structure needed to set up a workflow by calling
::use_github_actions() usethis
This creates a .github
folder, in which you find a .gitignore
file and a workflows/
folder. In this folder you can store all your GitHub actions in .yaml
file format. usethis
automatically added a first GitHub action called R-CMD-check.yaml
. This action is triggered on every push or pull request on the master/main branch and checks your package to make sure that new code does not break the package.
If you want to add other actions that already are available (e.g. from the GitHub repository r-lib/actions) you can do the following:
::use_github_action("render-readme.yaml") usethis
This adds the respective yaml-file to .github/workflows
. Of course you can also write your own actions and simply add them to the workflows/
folder.
Add an RStudio server with holepunch
There is a nice intro for holepunch: Getting started with holepunch. The code below is copiedânâpasted from the holepunch repoâs readme.
::install_github("karthik/holepunch")
remotes
::write_dockerfile(maintainer = "your_name")
holepunch# To write a Dockerfile. It will automatically pick the date of the last
# modified file, match it to that version of R and add it here. You can
# override this by passing r_date to some arbitrary date
# (but one for which a R version exists).
::generate_badge() # This generates a badge for your readme.
holepunch
# ----------------------------------------------
# At this time đ push the code to GitHub đ
# ----------------------------------------------
# And click on the badge or use the function below to get the build
# ready ahead of time.
::build_binder()
holepunch# đ€đ
Some more tips
You can use
devtools::spell_check()
to test your entire package for spelling errors. This is very useful if you have reports or other text intensive files in there. By default, the spell check will be for American English. However, you can change the langauage by adding aLanguage
field to yourDESCRIPTION
file.If you want to write a new function and generate the corresponding R script in the
R/
directory, you can conveniently callusethis::use_r("functionName")
. This will create a filefunctionName.R
ind theR/
directory and open it for you.Use the
goodpractice
package to further test your package for parts which can be improved. Runninggoodpractice::gp()
{R} will tell you where your comment lines are too long, where you should consider renaming variables, if something is wrong with yourNAMESPACE
, âŠ
Links and further reading
- The Turing Way website is a very useful guide to reproducible data science
About Reasearch compendia
A guide to making your data analysis more reproducible: Nice talks and other helpful links and information on research compendia by Karthik Ram
R packages
Comprehensive book on how to create an R package by Hadley Wickham and Jennifer Bryan
usethis
: workflow package to automate tasks like project setup etc.pkgdown
: build a quick and easy website for your packagegoodpractice
: gives you advice about good practices when building R packages
Git and GitHub
Reproducible environments
The Holepunch package for R and Binder (adds a live, on-demand RStudio server to your R project on GitHub/GitLab)
Scientific publications
James, Justin. 2012. â10 Classic Mistakes That Plague Software Development Projects.â 2012. https://www.techrepublic.com/blog/10-things/10-classic-mistakes-that-plague-software-development-projects/.
Scheller, Robert M, Brian R Sturtevant, Eric J Gustafson, Brendan C Ward, and David J Mladenoff. 2010. âIncreasing the Reliability of Ecological Models Using Modern Software Engineering Techniques.â Frontiers in Ecology and the Environment 8 (5): 253â60. https://doi.org/10.1890/080141.
Baxter, Susan M., Steven W. Day, Jacquelyn S. Fetrow, and Stephanie J. Reisinger. 2006. âScientific Software Development Is Not an Oxymoron.â PLoS Computational Biology 2 (9): 0975â78. https://doi.org/10.1371/journal.pcbi.0020087.
Merali, Zeeya. 2010. âComputational Science: âŠError.â Nature 467 (7317): 775â77. https://doi.org/10.1038/467775a.
Storer, Tim. 2017. âBridging the Chasm: A Survey of Software Engineering Practice in Scientific Programming.â ACM Computing Surveys 50 (4). https://doi.org/10.1145/3084225.
Marwick, Ben, Carl Boettiger, and Lincoln Mullen. 2018. âPackaging Data Analytical Work Reproducibly Using R (and Friends).â American Statistician 72 (1): 80â88. https://doi.org/10.1080/00031305.2017.1375986.