Making R fly with Rcpp
What should we do when we need to make our script run faster? Usually the first idea we have is optimizing the code: reducing the amount of loops, making data structures smaller, using parallel computing, etc… But when it comes to R, we can speed up our code almost without changing its structure.
In this post I’ll give you a basic overview of the Rcpp
package, a tool that allows us
to run C++ code from within R.
The basics
C++ is a very famous and versatile programming language. As Wikipedia points out, “it has imperative, object-oriented and generic programming features, while also providing facilities for low-level memory manipulation”.
An interesting characteristic of C++ is that it is extremely fast. Unlike R, it is a compiled language, with static typing and that doesn’t supply the user with too many abstractions, allowing the code to execute with impressive efficiency.
To explore the benefits that C++ can bring to your R code, install and load the Rcpp
package with the commands below:
install.packages("Rcpp")
library(Rcpp)
Now let’s take a look at a simple example of how to call C++ code from within R. The easiest
way to do this is with the cppFunction()
: it takes a string that will be interpreted as
a C++ function.
add_r <- function(x, y, z) {
sum = x + y + z
return(sum)
}
cppFunction(
"int add_c(int x, int y, int z) {
int sum = x + y + z;
return sum;
}")
add_r(1, 2, 3)
#> [1] 6
add_c(1, 2, 3)
#> [1] 6
As we can see in the example above, both functions have the same behavior despite some
superficial differences. Note how we always have to declare the types of the variables in
C++! Using the int
keyword we make it clear to the compiler that a variable will be of
type integer and even that a function will return a value of type integer. Another important
thing to remember is that we need to add a semi-colon to the end of every C++ command.
Vectors
Normally C++ would have enormous differences in relation to R regarding how they deal with
vectors, but (lucky us!) Rcpp
has a library of structures that abstract R’s behavior. In
the next example we have a function that takes a number and a numeric vector, computes the
euclidean distance between the value and the vector and returns a numeric vector as output.
dist_r <- function(x, ys) {
sqrt((x - ys) ^ 2)
}
cppFunction(
"NumericVector dist_c(double x, NumericVector ys) {
int n = ys.size();
NumericVector out(n);
for(int i = 0; i < n; i++) {
out[i] = sqrt(pow(ys[i] - x, 2.0));
}
return out;
}")
dist_r(10, 20:25)
#> [1] 10 11 12 13 14 15
dist_c(10, 20:25)
#> [1] 10 11 12 13 14 15
The NumericVector
structure abstracts a numeric vector from R, allowing us to work with
it in a more familiar manner. With the .size()
method we get its length and can declare a
new one with the NumericVector name(length);
constructor. The only fundamentally different
thing between C++ and R is that the first doesn’t have properly vectorized operations,
meaning that we have to use loops for every iteration.
Maximum speed
Certain aspects of R’s philosophy make it an extremely versatile language, but this has its price. Some areas where R’s performance is a bit lacking are non-vectorizable loops (when an iteration depends on the previous one), recursive functions, and complex data structures.
In these and many other situations, using C++ can be very advantageous. In the following example we’ll see the difference in performance between a loop written in C++ and one in R; note that this isn’t even one of the 3 situations mentioned above and that the C++ code is still 6 times faster.
sum_r <- function(v) {
total <- 0
for (e in v) {
if (e < 0) { total = total - e }
else if (e > 0.75) { total = total + e/2 }
else { total = total + e }
}
return(total)
}
cppFunction(
"double sum_c(NumericVector v) {
double total = 0;
for (int i = 0; i < v.size(); i++) {
if (v[i] < 0) { total -= v[i]; }
else if (v[i] > 0.75) { total += v[i]/2; }
else { total += v[i]; }
}
return(total);
}")
v <- runif(100000, -1, 1)
microbenchmark::microbenchmark(sum_r(v), sum_c(v))
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> soma_r(v) 6.105048 6.436608 6.911819 6.718456 7.183266 11.610624 100
#> soma_c(v) 1.045805 1.063956 1.161585 1.097920 1.210052 1.955702 100
Obs.: The symbols +=
and -=
stand for a = a +/- b
, while ++
stands for
a = a + 1
.
Conclusion
With the Rcpp
package we can run C++ code from within R. With this technique we can
optimize our code or even have access to complex data structures from C++.
To learn more about this subject take a look at the tutorial written by Hadley Wickham in the book Advanced R. I also recommend Rcpp
’s
own web page and its extensive gallery of examples.
P.S.: If you want the complete code for this tutorial, I’ve made it available as a Gist.