Efficient R

Scientific workflows: Tools and Tips ๐Ÿ› ๏ธ

Dr. Selina Baldauf

2023-11-16

What is this lecture series?

Scientific workflows: Tools and Tips ๐Ÿ› ๏ธ

๐Ÿ“… Every 3rd Thursday ๐Ÿ•“ 4-5 p.m. ๐Ÿ“ Webex

Main reference

Efficient R book by Gillespie and Lovelace, read it here

What is efficiency?

\[ \textsf{efficiency} = \frac{\textsf{work done}}{\textsf{unit of effort}} \]

Computational efficiency

๐Ÿ’ป Computation time
๐Ÿ’พ Memory usage

Programmer efficiency

๐Ÿง How long does it take to

  • write code?
  • maintain code?
  • read and understand the code?

Tradeoffs and Synergies between these types of efficiencies

Today

Principles and tools to make R programming more effiicent for ๐Ÿ’ป

Check out my talk โ€œWhat they forgot to teach you about Rโ€ for first two levels

Efficient R code and optimization

How can I make my R code faster?

Is R slow?

  • R is slow compared to other programming languages (e.g. C++).

    • R is designed to make programming easy, not fast
  • R is not designed to be memory efficient

  • But: R is fast and memory efficient enough for most tasks.

Should I optimize?

Itโ€™s easy to get caught up in trying to remove all bottlenecks. Donโ€™t! Your time is valuable and is better spent analysing your data, not eliminating possible inefficiencies in your code. Be pragmatic: donโ€™t spend hours of your time to save seconds of computer time.
(Hadley Wickham in Advanced R)

Think about

  • How much time do I save by optimizing?
  • How often do I run the code?
  • How much time do I spend optimizing?

Often: Trade-off between readability and efficiency

Should I optimize?

If your code is to slow for you, you can go through these steps:

  1. If possible, run the code somewhere else

Run the code somewhere else

  • For this, RStudio has background jobs

  • Or: run it on a cluster (e.g. FU Curta)

Should I optimize?

If your code is to slow for you, you can go through these steps:

  1. If possible, run the code somewhere else
  1. Identify the critical (slow) parts of your code
  2. Then optimize only the bottlenecks

Identify critical parts of your code

Profiling & Benchmarking to measure the speed and memory use of your code

Profiling R code

What are the speed & memory bottlenecks in my code?

Profiling R code

You can profile a section of code like this:

# install.packages("profvis")
library(profvis)

# Create a data frame with 150 columns and 400000 rows
df <- data.frame(matrix(rnorm(150 * 400000), nrow = 400000))

profvis({
  # Calculate mean of each column and put it in a vector
  means <- apply(df, 2, mean)

  # Subtract mean from each value in the table
  for (i in seq_along(means)) {
    df[, i] <- df[, i] - means[i]
  }
})

Profiling R code

Profvis flame graph shows time and memory spent in each line of code.

Profiling R code

Profvis data view for details on time spent in each function in the call stack.

Profiling R code

You can also interactively profile code in RStudio:

  • Go to Profile -> Start profiling
  • Now interactively run the code you want to profile
  • Go to Profile -> Stop profiling to see the results

Benchmarking R code

Which version of the code is faster?

# Fill a data frame in a loop
f1 <- function() {
  x <- data.frame(a = numeric(), b = numeric())
  for (i in 1:1e4) {
    x[i, ] <- c(i, i)
  }
}

# Fill a data frame directly with vectors
f2 <- function() {
  x <- data.frame(a = 1:1e4, b = 1:1e4)
}

Benchmarking R code

Use the microbenchmark package to compare the functions:

# install.packages("microbenchmark")
library(microbenchmark)

compare_functions <- microbenchmark(
  old = f1(),
  new = f2(),
  times = 10 # default is 100
)

compare_functions
#> Unit: microseconds
#>  expr         min          lq        mean      median          uq         max
#>   old 1194240.102 2346931.900 2355219.671 2453898.601 2562554.701 2710909.900
#>   new      78.201     209.501     455.951     296.601     513.601    1797.301
#>  neval cld
#>     10  a 
#>     10   b

We can look at benchmarking results using ggplot

library(ggplot2)
autoplot(compare_functions)

Benchmarking R code

Optimize your code

  • Basic principles
  • Data analysis bottlenecks
  • Advanced optimization: Parallelization and C++

Basic principles

Vectorize your code

  • Vectors are central to R programming
  • R is optimized for vectorized code
    • Implemented directly in C/Fortran
  • Vector operations can often replace for-loops in R
  • If there is a vectorized version of a function: Use it

Vectorize your code

Example: Calculate the log of every value in a vector and sum up the result

# A vector with 1 million values
x <- 1:1e6

microbenchmark(
  for_loop = {
    log_sum <- 0
    for (i in 1:length(x)) {
      log_sum <- log_sum + log(x[i])
    }
  },
  sum = sum(log(x)),
  times = 10
)
#> Unit: milliseconds
#>      expr     min      lq     mean   median      uq      max neval cld
#>  for_loop 66.0959 76.7144 86.57094 85.79545 96.1255 106.1095    10  a 
#>       sum 23.4454 28.5556 41.80719 37.01450 48.5042  75.6973    10   b

For-loops in R

  • For-loops are relatively slow and it is easy to make them even slower with bad design

  • Often they are used when vectorized code would be better

  • For loops can often be replaced, e.g. by

    • Functions from the apply family (e.g. apply, lapply, โ€ฆ)
    • Vectorized functions (e.g. sum, colMeans, โ€ฆ)
    • Vectorized functions from the purrr package (e.g. map)

But: For loops are not necessarily bad, sometimes they are the best solution and more readable than vectorized code.

Cache variables

If you use a value multiple times, store it in a variable to avoid re-calculation

Example: Calculate column means and normalize them by the standard deviation

# A matrix with 1000 columns
x <- matrix(rnorm(10000), ncol = 1000)

microbenchmark(
  no_cache = apply(x, 2, function(i) mean(i) / sd(x)),
  cache = {
    sd_x <- sd(x)
    apply(x, 2, function(i) mean(i) / sd_x)
  }
)
#> Unit: milliseconds
#>      expr       min        lq      mean    median        uq      max neval cld
#>  no_cache 43.516000 74.304451 91.975095 87.630301 109.62345 282.6212   100  a 
#>     cache  3.277201  5.801001  8.487946  7.329301  10.62315  19.3381   100   b

Efficient data analysis

Efficient workflow

  • Prepare the data to be clean and concise for analysis
    • Helps to avoid unnecessary calculations
  • Save intermediate results
    • Donโ€™t re-run time consuming steps if not necessary
  • Use the right packages and functions

Read data

Example: Read csv data on worldwide emissions of greenhouse gases (~14000 rows, 7 cols).

  • Base-R functions to read csv files are:
    • read.table
    • read.csv
  • There are many alternatives to read data, e.g.:
    • read_csv from the readr package (tidyverse)
    • fread from the data.table package

Read data

Compare some alternative reading functions

file_path_csv <- here::here("slides/data/ghg_ems_large.csv")

compare_input <- microbenchmark::microbenchmark(
  read.csv = read.csv(file_path_csv),
  read_csv = readr::read_csv(
    file_path_csv,
    progress = FALSE,
    show_col_types = FALSE
  ),
  fread = data.table::fread(file_path_csv, showProgress = FALSE),
  times = 10
)

autoplot(compare_input)

Read data

Use plain text data

Reading plain text is faster than excel files

file_path_xlsx <- here::here("slides/data/ghg_ems_large.xlsx")

compare_excel <- microbenchmark(
  read_csv = readr::read_csv(
    file_path_csv,
    progress = FALSE,
    show_col_types = FALSE
  ),
  read_excel = readxl::read_excel(file_path_xlsx),
  times = 10
)

autoplot(compare_excel)

Use plain text data

Write data

  • Base-R functions to write files are:
    • write.table
    • write.csv
  • Faster alternatives are:
    • write_csv from the readr package (tidyverse)
    • fwrite from the data.table package

Write data

Efficient data manipulation

Different packages offer fast and efficient data manipulation and analysis:

  • dplyr package has a C++ backend and is often faster than base R
  • data.table package is fast and memory efficiency
    • Syntax is quite different from base R and tidyverse
  • collapse package is a C++ based and specifically developed for fast data analysis
    • Works together with both tidyverse and data.table workflows
    • Many functions similar to base R or dplyr just with prefix โ€œfโ€ (e.g. fselect, fmean, โ€ฆ)

Summarize data by group

Example: Summarize mean carbon emissions from Electricity by Country

library(data.table)
library(dplyr)
library(collapse)

Summarize data by group

Example: Summarize mean carbon emissions from Electricity by Country

# 1. The data table way
# Convert the data to a data.table
setDT(ghg_ems)
summarize_dt <- function() {
  ghg_ems[, mean(Electricity, na.rm = TRUE), by = Country]
}

# 2. The dplyr way
summarize_dplyr <- function() {
  ghg_ems |>
    group_by(Country) |>
    summarize(mean_e = mean(Electricity, na.rm = TRUE))
}

# 3. The collapse way
summarize_collapse <- function() {
  ghg_ems |>
    fgroup_by(Country) |>
    fsummarise(mean_e = fmean(Electricity))
}

Summarize data by group

Example: Summarize mean carbon emissions from Electricity by Country

# compare the speed of all versions
microbenchmark(
  dplyr = summarize_dplyr(),
  data_table = summarize_dt(),
  collapse = summarize_collapse(),
  times = 10
)
#> Unit: microseconds
#>        expr      min       lq      mean   median       uq       max neval cld
#>       dplyr 5126.601 5452.801 9086.5510 5790.001 7519.301 31347.801    10  a 
#>  data_table 1267.202 1635.500 2690.5109 1741.601 2564.101  8412.702    10   b
#>    collapse  204.000  253.902  711.5509  513.201  781.902  1916.601    10   b

Select columns

Example: Select columns Country, Year, Electricity, Transportation

microbenchmark(
  dplyr = select(ghg_ems, Country, Year, Electricity, Transportation),
  data_table = ghg_ems[, .(Country, Electricity, Transportation)],
  collapse = fselect(ghg_ems, Country, Electricity, Transportation),
  times = 10
)
#> Unit: microseconds
#>        expr      min       lq      mean   median       uq      max neval cld
#>       dplyr 1409.801 1628.102 2282.1010 1765.351 2694.201 4544.301    10 a  
#>  data_table  696.001 1067.101 1399.4208 1277.101 1698.501 2612.501    10  b 
#>    collapse    5.701   11.301   27.9211   21.751   37.900   80.401    10   c

Advanced optimization

Parallelization and C++

Parallelization

By default, R works on one core but CPUs have multiple cores

# Find out how many cores you have with the parallel package
# install.packages("parallel")
parallel::detectCores()
#> [1] 14

Sequential

Parallel

Parallelization with the futureverse

  • future is a framework to help you parallelize existing R code
    • Parallel versions of base R apply family
    • Parallel versions of purrr functions
    • Parallel versions of foreach loops
  • Find more details here
  • Find a tutorial for different use cases here

A slow example

Letโ€™s create a very slow square root function

slow_sqrt <- function(x) {
  Sys.sleep(1) # simulate 1 second of computation time
  sqrt(x)
}


Before you run anything in parallel, tell R how many cores to use:

library(future)
# Plan parallel session with 6 cores
plan(multisession, workers = 6)

Parallel apply functions

To run the function on a vector of numbers we could use

The sequential version

# to measure the runtime
library(tictoc)

# create a vector of 10 numbers
x <- 1:10

tic()
result <- lapply(x, slow_sqrt)
toc()
#> 10.2 sec elapsed

The parallel version

# Load future.apply package
library(future.apply)

tic()
result <- future_lapply(x, slow_sqrt)
toc()
#> 2.19 sec elapsed

Parallel apply functions

Selected base R apply functions and their future versions:

base future.apply
lapply future_lapply
sapply future_sapply
vapply future_vapply
mapply future_mapply
tapply future_tapply
apply future_apply
Map future_Map

Parallel for loops

A normal for loop:

z <- list()
for (i in 1:10) {
  z[i] <- slow_sqrt(i)
}

Use foreach to write the same loop

library(foreach)
z <- foreach(i = 1:10) %do%
  {
    slow_sqrt(i)
  }

Parallel for loops

Use doFuture and foreach package to parallelize for loops

The sequential version

library(foreach)

tic()
z <- foreach(i = 1:10) %do%
  {
    slow_sqrt(i)
  }
toc()
#> 10.2 sec elapsed

The parallel version

library(doFuture)

tic()
z <- foreach(i = 1:10) %dofuture%
  {
    slow_sqrt(i)
  }
toc()
#> 2.36 sec elapsed

Close multisession

When you are done working in parallel, explicitly close your multisession:

# close the multisession plan
plan(sequential)

Replace slow code with C++

  • Use the Rcpp package to re-write R functions in C++
  • Rcpp is also used internally by many R packages to make them faster
  • Requirements:
    • C++ compiler installed
    • Some knowledge of C++
  • See this book chapter and the online documentation for more info

Rewrite a function in C++

Example: R function to calculate Fibonacci numbers

# A function to calculate Fibonacci numbers
fibonacci_r <- function(n) {
  if (n < 2) {
    return(n)
  } else {
    return(fibonacci_r(n - 1) + fibonacci_r(n - 2))
  }
}


# Calculate the 30th Fibonacci number
fibonacci_r(30)
#> [1] 832040

Rewrite a function in C++

Use cppFunction to rewrite the function in C++:

library(Rcpp)

# Rewrite the fibonacci_r function in C++
fibonacci_cpp <- cppFunction(
  'int fibonacci_cpp(int n){
    if (n < 2){
      return(n);
    } else {
      return(fibonacci_cpp(n - 1) + fibonacci_cpp(n - 2));
    }
  }'
)


# calculate the 30th Fibonacci number
fibonacci_cpp(30)
#> [1] 832040

Rewrite a function in C++

You can also source C++ functions from C++ scripts.

C++ script fibonacci.cpp:

#include "Rcpp.h"

// [[Rcpp::export]]
int fibonacci_cpp(const int x) {
   if (x < 2) return(x);
   return (fibonacci(x - 1)) + fibonacci(x - 2);
}

Then source the function in your R script using sourceCpp:

sourceCpp("fibonacci.cpp")

# Use the function in your R script like you are used to
fibonacci_cpp(30)

How much faster is C++?

microbenchmark(
  r = fibonacci_r(30),
  rcpp = fibonacci_cpp(30),
  times = 10
)
#> Unit: microseconds
#>  expr      min       lq      mean    median       uq      max neval cld
#>     r 568264.4 739476.7 773932.16 780669.50 858716.4 912403.6    10  a 
#>  rcpp    957.7   1030.6   2088.05   1462.25   2233.2   4997.2    10   b

Summary

Next lecture

Topic t.b.a.


๐Ÿ“… 18th January ๐Ÿ•“ 4-5 p.m. ๐Ÿ“ Webex

๐Ÿ”” Subscribe to the mailing list

๐Ÿ“ง For topic suggestions and/or feedback send me an email

Thank you for your attention :)

Questions?