This short post shows some different options for speeding up matrix decompositions in R, which can speed up a wide variety of commmon statistical functions.
For many statistical algorithms, the computational bottleneck is a matrix decomposition, such as a QR decomposition or a spectral decomposition. That’s the case for the generalized replication method that is implemented in the {svrep} package in R. Almost all of the computation time is taken up with a single spectral decomposition where we retrieve all of the eigenvalues and a full matrix of eigenvectors.
It turns out that there are many ways to perform matrix decompositions in R, and there are multiple approaches one can take to speed them up. You can try to use better hardware, use multithreading, or try using different computational algorithms. This post outlines a few options.
Option 1. Use the Built-in LAPACK / BLAS Implementation
Modern high-level programming languages like R, Python, or Julia all rely on the same basic low-level linear algebra libraries to do their matrix computations. These low-level libraries are called BLAS (Basic Linear Algebra Subprograms) and LAPACK (Linear Algebra PACKage), and they are fundamental components of most scientific computing software.
BLAS and LAPACK have been implemented in multiple low-level languages, but their standard reference implementation is in Fortran. The Fortran reference implementation is reliable, widely-used, and works on all the major operating systems. When you install R, the standard reference implementation is installed too. This is what does the heavy lifting when you run R code like the following, which computes a spectral decomposition of a matrix.
# Create an example symmetric matrix X <-cov(matrix(rnorm(n =100*5), nrow =100, ncol =5))# Compute an eigenvalue decompositioneigen(X)
By extension, the Fortran BLAS / LAPACK implementation is also what does the heavy lifting for many of the computations in standard statistical functions like lm() or glm().
Option 2. Install an Optimized LAPACK / BLAS Version
The standard BLAS / LAPACK version in Fortran is stable and popular, but it leaves a lot of speed on the table. There are other widely-used BLAS / LAPACK versions which can be much faster, particularly ones that are tailored to your computer’s particular operating system. For example, Intel MKL can be much faster than the reference Fortran version, as long as you’re using a computer with an Intel CPU.
Julia and Python (specifically, the NumPy package) come with different BLAS / LAPACK libraries when they’re installed. The default Julia installation includes OpenBLAS, while NumPy tries to install a version particular to the OS.
In R, you have to install an alternate BLAS / LAPACK version yourself. This is easier on macOS or Linux machines, but it’s not too difficult on Windows. Here are a few links for other Windows users who want to install OpenBlas or Intel MKL.
This installation process is very much worth your time. It can make a huge difference in the performance of many different computations in R. And once it’s installed, you don’t have to change anything about your R code. Just use the same R code you were already using.
By default, the matrix operations are single threaded. However, if your computer has multiple cores available (it almost certainly does), then sometimes it can be faster to use multithreaded computations. This can be done by setting environment variables on your computer that tell OpenBlas, Intel MKL, etc. how many threads they can use. For example, on my personal laptop, I set the environment variable OPENBLAS_NUM_THREADS to 12, so that matrix computations with BLAS / LAPACK use 12 threads. Multithreading can actually reduce performance for small matrices, but multithreading can noticeably improve computation time for large matrices.
Option 3. Use the ‘torch’ Package
The {torch} package in R is similar to Python’s PyTorch, and represents an excellent tool for demanding applications like deep learning or optimization. It provides an interface to performant C++ libraries and is typically the most reliable interface to GPUs when working in R, provided that you have an NVIDIA GPU.
To use {torch}, first you have to go through a couple installation steps above and beyond simply running install.packages('torch'). Here’s more information on that.
To compute a matrix decomposition using {torch}, we first create a tensor object out of the matrix we want to decompose. The term ‘tensor’ here simply refers to arrays such as vectors or matrices. By using the device argument and the function torch_device(), we can specify whether we want to allocate the object on the CPU or the GPU. The example below code keeps the data on the CPU and computes an eigendecomposition.
library(torch)# Create a torch 'tensor' object on the CPU X_tensor <- X |> torch::torch_tensor(device = torch::torch_device(type ="cpu") )# Compute the spectral decompsition # NOTE: `linalg_eigh()` assumes matrix is Hermitian,# `linalg_eig()` can be used more generally torch_eigen_results <- X_tensor |> torch::linalg_eigh()print(torch_eigen_results)
Alternatively, if we have an NVIDIA GPU on our computer with CUDA enabled, we can allocate the data to the GPU by specifying torch::torch_device(type = "cuda"). This lets us use the GPU for potentially faster computations.
library(torch)# Create a torch 'tensor' object on the CPU X_tensor <- X |> torch::torch_tensor(device = torch::torch_device(type ="cuda") )# Compute the spectral decompsition # NOTE: `linalg_eigh()` assumes matrix is Hermitian,# `linalg_eig()` can be used more generally torch_eigen_results <- X_tensor |> torch::linalg_eigh()print(torch_eigen_results)
The ability to use the GPU here can really help for large matrices. And whether we use the CPU or the GPU, we can control multithreading by simply calling a {torch} function.
# Use 12 threadstorch_set_num_threads(12)# Compute the spectral decompsition torch_eigen_results <- X_tensor |> torch::linalg_eigh()print(torch_eigen_results)
Option 4. Instead of BLAS / LAPACK, Use the Rust Library ‘faer’
The ‘faer’ library in Rust is an alternative to BLAS / LAPACK: it implements basic matrix computations entirely in Rust, sometimes using different default algorithms than in BLAS / LAPACK. You can use it from R to implement many different matrix computations. I’ll show an example of that, but first it’s important to note some key pros and cons of using the Rust ‘faer’ library from the perspective of someone using R for statistical work.
Pros of using ‘faer’:
Works nicely if you’re already doing things in Rust.
Rust is both safer and much nicer to work with than C++ or C in my limited experience. The tooling around it is much newer and user-friendly. Unlike with C++, you don’t have to muck around with frustrating, archaic tools like Valgrind.
The community of R developers for Rust is friendly and helpful, and there are lively conversations on the R-Rust Discord server.
Can be much faster than BLAS / LAPACK implementations.
Cons of using ‘faer’:
It’s relatively new and has much less adoption so far compared to older BLAS / LAPACK libraries, and so it may be harder to trust simply because it’s had fewer eyes on it looking for bugs.
The package’s documentation is overwhelming and can be hard to read.
It is 99% the work of one (brilliant and prolific) developer and so the library has a bus factor of approximately one. That’s OK for many open source projects, but for one of this complexity it’s a little concerning.
The cons ought to improve over time, but they’re important to be aware of. Anyway, on to the code.
Rust code can be made available in R using the {rextendr} package (described in a nice recent JOSS paper), which is analogous to the {cpp11} or {Rcpp} R packages that work with C++. You can write Rust code and immediately get an R function that runs that code.
To illustrate, here’s Rust code that implements a spectral decomposition using ‘faer’.
useextendr_api::prelude::*;usefaer::{mat, Mat, MatRef, Col};#[extendr]fn eigen_faer(x: Mat<f64>) -> List {// Eigendecompositionlet evd = x.selfadjoint_eigendecomposition(faer::Side::Lower);// Extract matrix of eigenvectors and column matrix of eigenvalues// (Note that 'faer' uses ascending order of eigenvalues,// so we rearrange the output to be in descending order,// for consistency with R)let eigenvectors: Mat<f64>= evd.u().reverse_cols().to_owned();let eigenvalues: Col<f64>= evd.s().column_vector().reverse_rows().to_owned();// Reformat eigenvalues to a vector instead of a one-column matrixlet eigenvalues:Vec<f64>= (0..eigenvalues.nrows()).map(|i| eigenvalues.read(i)).collect();list!(values = eigenvalues, vectors = eigenvectors)}
This code is saved in a file named “r-eigendecompostion.rs”, which we can refer to from R. The following R code compiles and loads the Rust function, named eigen_faer(), so that we can call it from R.
library(rextendr)# Compile the Rust code and make it available as a function in Rrust_source(code =readLines("r-eigendecomposition.rs"),dependencies =list("faer"="*"),features ="faer",use_dev_extendr =TRUE,profile ="release" )
So far I’ve found Rust to be much nicer to work with than C++. With Rust, you don’t run into dreaded memory errors and you get to work with much nicer tooling compared to C++, including much easier to read messages from the compiler when you inevitably make mistakes. The R interface package {rextendr} works quite nicely. The main drawbacks are the steep learning curve of Rust and of the ‘faer’ library, and if you’re a package developer, there are currently some unnecessary headaches with keeping a Rust package on CRAN while the CRAN team irons out its process of running automated checks for packages that use Rust. But those problems should get better over time.
What I found particularly appealing is that the community of R developers working with Rust is quite helpful and friendly. The Discord server for R and Rust developers is an excellent resource, and I’m grateful for helpful suggestions I got there from Josiah Parry and Mossa Merhi Reimert while learning how to use {rextendr} and ’faer.
Some Informal Benchmarks
Just to get a rough sense of how these different options compare, let’s do a little informal benchmarking. This is done on my personal laptop, which has an AMD Ryzen processor and an NVIDIA GPU. We’ll conduct a spectral decomposition of a symmetric matrix with 2,000 rows and columns.
Here’s the example matrix.
# Create an example datasetX <-cov(matrix(rnorm(2000^2), nrow =2000, ncol =2000))
Here’s how long it takes for the spectral decomposition using the base BLAS / LAPACK version that came installed with R: about 5-6 seconds.
microbenchmark('Default BLAS / LAPACK'=eigen(X),times =10)#> Unit: milliseconds#> expr min lq mean median uq max neval#> Default BLAS / LAPACK 5546.7500 5579.2213 5658.1628 5685.0989 5711.3217 5769.0550 10
Now let’s re-run but after replacing the default BLAS / LAPACK installation with OpenBLAS. We’ll also run the computations using {torch}.
# Check the number of threads used by OpenBLASSys.getenv("OPENBLAS_NUM_THREADS")#> [1] "1"# Tell 'torch' to use 1 threadtorch_set_num_threads(1)# Benchmark different single-threaded computation approacheslibrary(microbenchmark)microbenchmark('OpenBLAS - 1 thread'=eigen(X),'torch CPU - 1 thread'= X |>torch_tensor(device =torch_device("cpu")) |>linalg_eigh(),'torch GPU - 1 thread'= X |>torch_tensor(device =torch_device("cuda")) |>linalg_eigh(),times =10 )#> Unit: milliseconds#> expr min lq mean median uq max neval#> OpenBLAS - 1 thread 832.5615 845.6398 867.3071 849.2613 854.4560 951.0436 10#> torch CPU - 1 thread 752.0325 758.6254 767.7870 768.4964 776.0672 779.7245 10#> torch GPU - 1 thread 88.7192 89.8873 150.1024 102.8186 182.8322 335.6845 10
We can see that using OpenBLAS drastically improved the speed of the computations: this runs over 5 times faster than the default BLAS / LAPACK version. Using {torch} with the CPU was only a slight bit faster, but {torch} with the GPU was much faster.
Next, we’ll compare the speed of three multithreaded options: OpenBLAS, {torch} with the data on the CPU, and {torch} with the data on the GPU.
library(microbenchmark)# Check the number of threads used by OpenBLASSys.getenv("OPENBLAS_NUM_THREADS")#> [1] "12"# Tell 'torch' to use 12 threadstorch_set_num_threads(12)# Benchmark different multithreaded computation approachesmicrobenchmark('OpenBLAS - 12 threads'=eigen(X),'torch CPU - 12 threads'=torch_tensor(X, device =torch_device("cpu")) |>linalg_eigh(),'torch GPU - 12 threads'=torch_tensor(X, device =torch_device("cuda")) |>linalg_eigh(),times =10 )#> Unit: milliseconds#> expr min lq mean median uq max neval#> OpenBLAS - 12 threads 620.7494 660.2177 677.7333 670.1151 702.8616 763.3497 10#> torch CPU - 12 threads 244.9438 251.1112 260.9905 262.8976 269.5531 273.9076 10#> torch GPU - 12 threads 68.4853 94.1227 151.9435 109.2608 179.2842 342.6218 10
Multithreading only made a substantial improvement when we used {torch} with the data on the CPU.
Now let’s see how fast the Rust implementation is.
rust_benchmark <-microbenchmark('Rust (faer)'=eigen_faer(X),times =10)print(rust_benchmark)#> Unit: milliseconds#> expr min lq mean median uq max neval#> Rust (faer) 579.6631 602.2368 677.074 710.0773 717.8803 727.6706 10
This performance is comparable to OpenBLAS, maybe a bit faster.
Below, we can see how all these various methods compare in terms of computation time on my computer for this application.
Show code
library(gt)library(gtExtras)# Load collected data from microbenchmarks benchmark_medians <- tibble::tribble(~Method, ~Milliseconds,"Default BLAS / LAPACK", 5685.0989,"Rust (faer)", 710.0773,"OpenBLAS - 1 thread", 849.2613,"torch CPU - 1 thread", 768.4964,"torch GPU - 1 thread", 102.8186,"OpenBLAS - 12 threads", 670.1151,"torch CPU - 12 threads", 262.8976,"torch GPU - 12 threads", 109.2608 )# Use the `gt`/`gtExtras` packages for visualization benchmark_medians |>mutate(``= Milliseconds,Milliseconds = scales::comma(Milliseconds)) |>gt() |>gt_plt_bar(column =``, keep_column =FALSE, width =35,color = schneidr::schneidr_purple())
Method
Milliseconds
Default BLAS / LAPACK
5,685.1
Rust (faer)
710.1
OpenBLAS - 1 thread
849.3
torch CPU - 1 thread
768.5
torch GPU - 1 thread
102.8
OpenBLAS - 12 threads
670.1
torch CPU - 12 threads
262.9
torch GPU - 12 threads
109.3
Takeaways
The main takeaway here is that you should take some time to update your R installation to use a better BLAS / LAPACK version than the one that comes installed with R by default. If you’ve never done it before, it might take you 15 to 20 minutes. But if you’ve done it before, it only takes about five minutes, and it makes a huge difference in performance for many applications.
OpenBLAS is a nice default choice of BLAS / LAPACK to install, but you might get better results with Accelerate on a macOS machine or with Intel MKL if you have an Intel CPU.
If your computer has an NVIDIA GPU, the ‘torch’ GPU option can be worthwhile, although you may get slight differences in results for computations conducted with the GPU vs the CPU. So if you go that route, you’ll want to consider the context of your work and determine whether the differences in results are fine.
If you’re developing complex, high-performance software, you can consider trying out a low-level language like C++ or Rust, and use R interface packages like {rextendr} or {cpp11}. That can be very effective, albeit at the cost of a steep learning curve.
So, nu, if you haven’t yet, go ahead and update your BLAS / LAPACK version, why don’t you.