For loop is several times faster in R than in Python using the rpy2 library

Tags: , , ,



The following simply for block takes about ~3 sec to complete in R:

library(MASS)
nruns <- 2000
nelems <- 50
maxX <- 1
maxY <- 1
for(i in 1:nruns) {
    dataX <- runif(nelems, 0, maxX)
    dataY <- runif(nelems, 0, maxY)
    kde2d(dataX, dataY, n=50, lims=c(0, maxX, 0, maxY) )
}

The same code run in Python through the rpy2 library takes between 4-5 times more:

from rpy2.robjects import r
from rpy2.robjects.packages import importr
importr('MASS')

nruns = 2000
r.assign('nelems', 50)
r.assign('maxX', 1)
r.assign('maxY', 1)
for _ in range(nruns):
    r('dataX <- runif(nelems, 0, maxX)')
    r('dataY <- runif(nelems, 0, maxY)')
    r('kde2dmap <- kde2d(dataX, dataY, n=50, lims=c(0, maxX, 0, maxY))')

Is this just because I’m using the rpy2 library to communicate with R or is there something else at play? Can this be improved in any way (while still running the code in Python)?

Answer

4 to 5 times slower seems a little much, but this might be the case if you are using custom conversion (rpy2 can convert R objects to arbitrary Python objects on the fly – see the doc).

Or may be you are on an HPC with a slow-ish NFS access for where your Python and packages are installed while R is on faster local disks (this could make a big difference on the startup time).

Otherwise one can also keep the loop in R to assess whether this changes the running time:

from rpy2.robjects import r
from rpy2.robjects.packages import importr

# importr('MASS')
# Calling 'importr' will perform quite a bit of work behind the
# scene. That works allows a more intuitive/pythonic use of the
# content of the R library "MASS", but if you are just passing
# a string to be evaluated for R evaluation you can skip it
# replace it with the following:
r('library("MASS")')

nruns = 2000
r.assign('nelems', 50)
r.assign('maxX', 1)
r.assign('maxY', 1)
r.assign('nruns', nruns)
r("""
for(i in 1:nruns) {
  dataX <- runif(nelems, 0, maxX)
  dataY <- runif(nelems, 0, maxY)
  kde2dmap <- kde2d(dataX, dataY, n=50, lims=c(0, maxX, 0, maxY) )
}
""")

Speed improvements will come from the following:

  • the code in your question is passing Python strings in each iteration. Each time that string will have to be parsed (by R) before it can be evaluated. This can amount to some overhead with long loops. In the code I am providing the parsing is performed only once.

  • the code in @Parfait’s answer is leveraging the fact that importr() creates Python object proxies for the R functions you want to use. However, there remains an overhead when creating the mapping with importr() (mapping are created for all objects in an R package), and at each iteration when going from Python to R (object checks and conversion, building of an R expression to be evaluated). Profiling the code would give you an exact breakdown of where time is spent. There exists ways to keep some of the pythonic aspects while retaining more performances. For example:

     import rpy2.rinterface as ri
     ri.initr()
     ri.baseenv['library']("MASS")
     # early bindings for R functions:
     runif = ri.globalenv.find('runif')
     kde2d = ri.globalenv.find('kde2d')
     # create constant values in loop as R objects
     maxX = ri.IntVector((1, ))
     maxY = ri.IntVector((1, ))
     nelems = ri.IntVector((50, ))
     zero = ri.IntVector((0, ))
     limits = ri.IntVector((0, maxX[0], 0, maxY[0]))
     for i in range(nruns):
         dataX = runif(nelems, zero, maxX)
         dataY = runif(nelems, zero, maxY)
         kde2dmap = kde2d(dataX, dataY, n=nelems, lims=limits)
    

An additional comment about performance is that rpy2’s transition from C-extension to cffi has lead to significant improvements in the structure of the code managing the dialog with R’s C API (and with that a number of tricky bugs where fixed), but at the temporary cost of performance here and there. Optimizations for speed are being progressively reintroduced.



Source: stackoverflow