Conversion

rpy2 is using a conversion system with which rules to share objects between Python and R can be specified (see rpy2 documentation about conversion).

Rules for Arrow data structure can be specified, and the package rpy2-arrow already implements the bulk of what it is required. The only remaining part is how to combine them for your own need.

Faster pandas-R conversions

This example of custom conversion with jupyter and the “R magic” in rpy2.ipython demonstrates how Arrow can greatly improve performances when moving data between Python and R.

Note

This section is a jupyter notebook. It can be downloaded as a notebook from here.

We create a test pandas.DataFrame. The size is set to show a noticeable able effect without waiting too long for the slowest conversion on the laptop the notebook ran on. Feel free to change the variable _N to what suits best your hardware, and your patience.

In [1]: import pandas as pd
   ...: # Number or rows in the DataFrame.
   ...: _N = 500000
   ...: pd_dataf = pd.DataFrame({'x': range(_N),
   ...:                          'y': ['abc', 'def'] * (_N//2)})
   ...: 

Next we load the ipython/jupyter extension in R to communicate with R in a (Python) notebook.

In [2]: %load_ext rpy2.ipython

With the extension loaded, the DataFrame can be imported in a R cell (declared with %%R) using the argument -i. It takes few seconds for the conversion system to create a copy of it in R on the machine where the notebook was written.

In [3]: %%time
   ...: %%R -i pd_dataf
   ...: print(head(pd_dataf))
   ...: rm(pd_dataf)
   ...: 
  x   y
0 0 abc
1 1 def
2 2 abc
3 3 def
4 4 abc
5 5 def
CPU times: user 2.75 s, sys: 34.2 ms, total: 2.78 s
Wall time: 2.8 s

From pandas.DataFrame to R data.frame through an Arrow Table

The conversion of a pandas.DataFrame can be accelerated by using Apache Arrow as an intermediate step. The package pyarrow is using compiled code to go efficiently from a pandas.DataFrame to an Arrow data structure, and the R package arrow can do the same from Arrow data structure to an R data.frame.

The package rpy2-arrow can help manage the conversion between Python wrappers to Arrow data structures (Python package pyarrow) and R wrappers to Arrow data structures (R package arrow). Creating a custom converter for rpy2 is done in few lines of code.

In [4]: import pyarrow
   ...: from rpy2.robjects.packages import importr
   ...: import rpy2.robjects.conversion
   ...: import rpy2.rinterface
   ...: import rpy2_arrow.arrow as pyra
   ...: 
   ...: base = importr('base')
   ...: 
   ...: # We use the converter included in rpy2-arrow as template.
   ...: conv = rpy2.robjects.conversion.Converter(
   ...:     'Pandas to data.frame',
   ...:     template=pyra.converter)
   ...: 
   ...: @conv.py2rpy.register(pd.DataFrame)
   ...: def py2rpy_pandas(dataf):
   ...:     pa_tbl = pyarrow.Table.from_pandas(dataf)
   ...:     # pa_tbl is a pyarrow table, and this is something
   ...:     # that the converter shipping with rpy2-arrow knows
   ...:     # how to handle.
   ...:     return base.as_data_frame(pa_tbl)
   ...: 
   ...: # We build a custom converter that is the default converter
   ...: # for ipython/jupyter shipping with rpy2, to which we add
   ...: # rules for Arrow + pandas we just made.
   ...: conv = rpy2.ipython.rmagic.converter + conv
   ...: 

Our custom converter conv can be specified as a parameter to %%R:

In [5]: %%time
   ...: %%R -i pd_dataf -c conv
   ...: print(class(pd_dataf))
   ...: print(head(pd_dataf))
   ...: rm(pd_dataf)
   ...: 
[1] "data.frame"
  x   y
1 0 abc
2 1 def
3 2 abc
4 3 def
5 4 abc
6 5 def
CPU times: user 32.8 ms, sys: 3.68 ms, total: 36.4 ms
Wall time: 40.2 ms

The conversion is much faster.

From pandas.DataFrame to and Arrow Table visible to R

It is also possible to only convert to an Arrow data structure.

In [6]: conv2 = rpy2.robjects.conversion.Converter(
   ...:     'Pandas to pyarrow',
   ...:     template=pyra.converter)
   ...: 
   ...: @conv2.py2rpy.register(pd.DataFrame)
   ...: def py2rpy_pandas(dataf):
   ...:     pa_tbl = pyarrow.Table.from_pandas(dataf)
   ...:     return pyra.converter.py2rpy(pa_tbl)
   ...: 
   ...: conv2 = rpy2.ipython.rmagic.converter + conv2
   ...: 
In [7]: %%time
   ...: %%R -i pd_dataf -c conv2
   ...: print(head(pd_dataf))
   ...: rm(pd_dataf)
   ...: 
Table
6 rows x 2 columns
$x <int64>
$y <string>

See $metadata for additional Schema metadata
CPU times: user 70.5 ms, sys: 3.85 ms, total: 74.3 ms
Wall time: 74.2 ms

This time the conversion is about as fast but is likely requiring less memory. When casting the Arrow data table into an R data.frame, I believe there is a moment in time where copies of the data will coexist in the Python DataFrame, in the Arrow table, and in the R data.frame. This is transient though; the Arrow table only exists during the scope of py2rpy_pandas for conv. For conv2, the data will only be copied once. It will coexist in the Python DataFrame and in the Arrow table (the content of which will be shared between Python and R if I understand it right).

The R package arrow implements methods for Arrow data structures to make their behavior close to data.frame objects. This can make Arrow data table work with R functions designed for data frames, and bring very significant performance gains. When in combination with rpy2-arrow, this means that Arrow tables accessed or created from Python can be used with R code without the performance penalty of copying data, and with the possible performance gain that the R package arrow may bring for such data structures. For example, with the R package dplyr:

In [8]: %%R
   ...: suppressMessages(require(dplyr))
   ...: 

Arrow.lib.Table shared across Python and R

An even more performant solution is to share an Arrow Table between Python and R. The package rpy2_arrow has a converter to just do that.

In [9]: tbl = pyarrow.lib.Table.from_pandas(pd_dataf)
In [10]: %%time
   ....: %%R -i tbl -c pyra.converter
   ....: print(head(tbl))
   ....: rm(tbl)
   ....: 
Table
6 rows x 2 columns
$x <int64>
$y <string>

See $metadata for additional Schema metadata
CPU times: user 5.53 ms, sys: 0 ns, total: 5.53 ms
Wall time: 5.54 ms

At the time of writing this is approximately 700 times faster than the pandas.DataFrame to R data.frame conversion performed without Arrow presented at the begining of this page.