Conversion¶
rpy2
is using a conversion system with which rules to share objects
between Python and R can be specified (see rpy2 documentation about
conversion).
Rules for Arrow data structure can be specified, and the package
rpy2-arrow
already implements the bulk of what it is required.
The only remaining part is how to combine them for your own need.
Faster pandas-R conversions¶
This example of custom conversion with jupyter and the
“R magic” in rpy2.ipython
demonstrates how
Arrow can greatly improve performances when moving
data between Python and R.
Note
This section is a jupyter notebook. It can be downloaded as a notebook from here.
We create a test pandas.DataFrame
. The size is set to show a
noticeable able effect without waiting too long for the slowest
conversion on the laptop the notebook ran on. Feel free to change the
variable _N
to what suits best your hardware, and your patience.
In [1]: import pandas as pd
...: # Number or rows in the DataFrame.
...: _N = 500000
...: pd_dataf = pd.DataFrame({'x': range(_N),
...: 'y': ['abc', 'def'] * (_N//2)})
...:
Next we load the ipython/jupyter extension in R to communicate with R in a (Python) notebook.
In [2]: %load_ext rpy2.ipython
With the extension loaded, the DataFrame
can be imported in a R cell
(declared with %%R
) using the argument -i
. It takes few seconds
for the conversion system to create a copy of it in R on the machine where
the notebook was written.
In [3]: %%time
...: %%R -i pd_dataf
...: print(head(pd_dataf))
...: rm(pd_dataf)
...:
x y
0 0 abc
1 1 def
2 2 abc
3 3 def
4 4 abc
5 5 def
CPU times: user 2.74 s, sys: 44.3 ms, total: 2.79 s
Wall time: 2.81 s
From pandas.DataFrame
to R data.frame through an Arrow Table¶
The conversion of a pandas.DataFrame
can be accelerated by using
Apache Arrow as an intermediate step. The package pyarrow
is using
compiled code to go efficiently from a pandas.DataFrame
to an Arrow
data structure, and the R package arrow
can do the same from Arrow data
structure to an R data.frame
.
The package rpy2-arrow
can help manage the conversion between Python
wrappers to Arrow data structures (Python package pyarrow
) and R
wrappers to Arrow data structures (R package arrow
). Creating a
custom converter for rpy2
is done in few lines of code.
In [4]: import pyarrow
...: from rpy2.robjects.packages import importr
...: import rpy2.robjects.conversion
...: import rpy2.rinterface
...: import rpy2_arrow.arrow as pyra
...:
...: base = importr('base')
...:
...: # We use the converter included in rpy2-arrow as template.
...: conv = rpy2.robjects.conversion.Converter(
...: 'Pandas to data.frame',
...: template=pyra.converter)
...:
...: @conv.py2rpy.register(pd.DataFrame)
...: def py2rpy_pandas(dataf):
...: pa_tbl = pyarrow.Table.from_pandas(dataf)
...: # pa_tbl is a pyarrow table, and this is something
...: # that the converter shipping with rpy2-arrow knows
...: # how to handle.
...: return base.as_data_frame(pa_tbl)
...:
...: # We build a custom converter that is the default converter
...: # for ipython/jupyter shipping with rpy2, to which we add
...: # rules for Arrow + pandas we just made.
...: conv = rpy2.ipython.rmagic.converter + conv
...:
Our custom converter conv
can be specified as a parameter to
%%R
:
In [5]: %%time
...: %%R -i pd_dataf -c conv
...: print(class(pd_dataf))
...: print(head(pd_dataf))
...: rm(pd_dataf)
...:
[1] "data.frame"
x y
1 0 abc
2 1 def
3 2 abc
4 3 def
5 4 abc
6 5 def
CPU times: user 36.5 ms, sys: 3.86 ms, total: 40.3 ms
Wall time: 40.3 ms
The conversion is much faster.
From pandas.DataFrame
to and Arrow Table visible to R¶
It is also possible to only convert to an Arrow data structure.
In [6]: conv2 = rpy2.robjects.conversion.Converter(
...: 'Pandas to pyarrow',
...: template=pyra.converter)
...:
...: @conv2.py2rpy.register(pd.DataFrame)
...: def py2rpy_pandas(dataf):
...: pa_tbl = pyarrow.Table.from_pandas(dataf)
...: return pyra.converter.py2rpy(pa_tbl)
...:
...: conv2 = rpy2.ipython.rmagic.converter + conv2
...:
In [7]: %%time
...: %%R -i pd_dataf -c conv2
...: print(head(pd_dataf))
...: rm(pd_dataf)
...:
Table
6 rows x 2 columns
$x <int64>
$y <string>
See $metadata for additional Schema metadata
CPU times: user 34.3 ms, sys: 94 us, total: 34.4 ms
Wall time: 34.2 ms
This time the conversion is about as fast but is likely requiring less
memory. When casting the Arrow data table into an R data.frame
, I
believe there is a moment in time where copies of the data will coexist
in the Python DataFrame
, in the Arrow
table, and in the R
data.frame
. This is transient though; the Arrow
table only
exists during the scope of py2rpy_pandas
for conv
. For
conv2
, the data will only be copied once. It will coexist in the
Python DataFrame
and in the Arrow
table (the content of which
will be shared between Python and R if I understand it right).
The R package arrow
implements methods for Arrow
data structures to make their behavior close to data.frame
objects.
This can make Arrow data table work with R functions designed for data frames,
and bring very significant performance gains. When in combination with
rpy2-arrow
, this means that Arrow tables accessed or created
from Python can be used with R code without the performance penalty of
copying data, and with the possible performance gain that the R package
arrow
may bring for such data structures. For example,
with the R package dplyr
:
In [8]: %%R
...: suppressMessages(require(dplyr))
...: