Interoperability with pandas¶
This section of the documentation is focused on the practical use of the conversion
helper for pandas
. The conversion from and to pandas.DataFrame
can create nonnegligible overhead as the C level representations for the underlying
arrays may differ between Python and R, and this create the need to copy data from
one representation to the other. This is the case for arrays of strings for example.
The use of a local converter to limit the scope of conversions, as shown here, is
recommended.
For more information about the conversion mechanism, check the more general documentation
about rpy2.robjects.conversion
.
Note
This section is available as a jupyter notebook pandas.ipynb (HTML render: pandas.html)
from functools import partial
from rpy2.ipython import html
html.html_rdataframe=partial(html.html_rdataframe, table_class="docutils")
R
and pandas
data frames¶
R data.frame
and :class:pandas.DataFrame
objects share a lot of
conceptual similarities, and :mod:pandas
chose to use the class name
DataFrame
after R objects.
In a nutshell, both are sequences of vectors (or arrays) of consistent length or size for the first dimension (the “number of rows”). if coming from the database world, an other way to look at them is column-oriented data tables, or data table API.
rpy2 is providing an interface between Python and R, and a convenience
conversion layer between :class:rpy2.robjects.vectors.DataFrame
and
:class:pandas.DataFrame
objects, implemented in
:mod:rpy2.robjects.pandas2ri
.
import pandas as pd
import rpy2.robjects as ro
from rpy2.robjects.packages import importr
from rpy2.robjects import pandas2ri
From pandas
to R
¶
Pandas data frame:
pd_df = pd.DataFrame({'int_values': [1,2,3],
'str_values': ['abc', 'def', 'ghi']})
pd_df
int_values | str_values | |
---|---|---|
0 | 1 | abc |
1 | 2 | def |
2 | 3 | ghi |
R data frame converted from a pandas
data frame:
with (ro.default_converter + pandas2ri.converter).context():
r_from_pd_df = ro.conversion.get_conversion().py2rpy(pd_df)
r_from_pd_df
int_values | str_values |
---|---|
... | ... |
The conversion is automatically happening when calling R functions. For
example, when calling the R function base::summary
:
base = importr('base')
with (ro.default_converter + pandas2ri.converter).context():
df_summary = base.summary(pd_df)
print(df_summary)
int_values str_values
Min. :1.0 Length:3
1st Qu.:1.5 Class :character
Median :2.0 Mode :character
Mean :2.0
3rd Qu.:2.5
Max. :3.0
Note that a ContextManager
is used to limit the scope of the
conversion. Without it, rpy2 will not know how to convert a pandas data
frame:
try:
df_summary = base.summary(pd_df)
except NotImplementedError as nie:
print('NotImplementedError:')
print(nie)
NotImplementedError:
Conversion 'py2rpy' not defined for objects of type '<class 'pandas.core.frame.DataFrame'>'
From R
to pandas
¶
Starting from an R data frame this time:
r_df = ro.DataFrame({'int_values': ro.IntVector([1,2,3]),
'str_values': ro.StrVector(['abc', 'def', 'ghi'])})
r_df
int_values | str_values |
---|---|
... | ... |
It can be converted to a pandas data frame using the same converter:
with (ro.default_converter + pandas2ri.converter).context():
pd_from_r_df = ro.conversion.get_conversion().rpy2py(r_df)
pd_from_r_df
int_values | str_values | |
---|---|---|
1 | 1 | abc |
2 | 2 | def |
3 | 3 | ghi |
Date and time objects¶
pd_df = pd.DataFrame({
'Timestamp': pd.date_range('2017-01-01 00:00:00', periods=10, freq='s')
})
pd_df
Timestamp | |
---|---|
0 | 2017-01-01 00:00:00 |
1 | 2017-01-01 00:00:01 |
2 | 2017-01-01 00:00:02 |
3 | 2017-01-01 00:00:03 |
4 | 2017-01-01 00:00:04 |
5 | 2017-01-01 00:00:05 |
6 | 2017-01-01 00:00:06 |
7 | 2017-01-01 00:00:07 |
8 | 2017-01-01 00:00:08 |
9 | 2017-01-01 00:00:09 |
with (ro.default_converter + pandas2ri.converter).context():
r_from_pd_df = ro.conversion.py2rpy(pd_df)
r_from_pd_df
Timestamp |
---|
... |
The timezone used for conversion is the system’s default timezone unless
rpy2.robjects.vectors.default_timezone
is specified… or unless the
time zone is specified in the original time object:
pd_tz_df = pd.DataFrame({
'Timestamp': pd.date_range('2017-01-01 00:00:00', periods=10, freq='s',
tz='UTC')
})
with (ro.default_converter + pandas2ri.converter).context():
r_from_pd_tz_df = ro.conversion.py2rpy(pd_tz_df)
r_from_pd_tz_df
Timestamp |
---|
... |