DataFrames and dplyr

dplyr

Note

This section is available as a jupyter notebook dplyr.ipynb (HTML render: dplyr.html).

See also Section R language as dplyr is using a lot unevaluated R language expressions.

from functools import partial
from rpy2.ipython import html
html.html_rdataframe=partial(html.html_rdataframe, table_class="docutils")

dplyr in Python

We need 2 things for this:

1- A data frame (using one of R’s demo datasets).

from rpy2.robjects.packages import importr, data
datasets = importr('datasets')
mtcars_env = data(datasets).fetch('mtcars')
mtcars = mtcars_env['mtcars']

In addition to that, and because this tutorial is in a notebook, we initialize HTML rendering for R objects (pretty display of R data frames).

import rpy2.ipython.html
rpy2.ipython.html.init_printing()

2- dplyr

from rpy2.robjects.lib.dplyr import DataFrame
from rpy2.robjects import rl
/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/rpy2/robjects/lib/dplyr.py:27: UserWarning: This was designed againt dplyr versions starting with 1.0 but you have 1.1.2
  warnings.warn(

With this we have the choice of chaining (D3-style)

dataf = (
    DataFrame(mtcars)
    .filter(rl('gear>3'))
    .mutate(powertoweight=rl('hp*36/wt'))
    .group_by(rl('gear'))
    .summarize(mean_ptw=rl('mean(powertoweight)'))
)

dataf
DataFrame with 2 rows and 2 columns:
gear mean_ptw
0 1 4.0 1237.1266499803169
1 2 5.0 2574.0331639315027

or with pipes (magrittr style).

# currently no longer working
from rpy2.robjects.lib.dplyr import (filter,
                                     mutate,
                                     group_by,
                                     summarize)

if False:
    dataf = (DataFrame(mtcars) >>
             filter(rl('gear>3')) >>
             mutate(powertoweight=rl('hp*36/wt')) >>
             group_by(rl('gear')) >>
             summarize(mean_ptw=rl('mean(powertoweight)')))

    dataf

The function rl creates unevaluated R language objects, which are then consummed by the dplyr function, just like it would be happening when using dplyr in R itself. This means that when writing mean(powertoweight) the R function mean() is used.

Using a Python function is not too difficult though. We can just call Python back from R. To achieve this we simply use the decorator rternalize.

# Define a python function, and make
# it a function R can use through `rternalize`
from rpy2.rinterface import rternalize
@rternalize(signature=False)
def mean_np(x):
    import statistics
    return statistics.mean(x)

# Bind that function to a symbol in R's
# global environment
from rpy2.robjects import globalenv
globalenv['mean_np'] = mean_np

# Write a dplyr chain of operations,
# using our Python function `mean_np`
dataf = (
    DataFrame(mtcars)
    .filter(rl('gear>3'))
    .mutate(powertoweight=rl('hp*36/wt'))
    .group_by(rl('gear'))
    .summarize(mean_ptw=rl('mean(powertoweight)'),
               mean_np_ptw=rl('mean_np(powertoweight)'))
)

dataf
DataFrame with 2 rows and 3 columns:
gear mean_ptw mean_np_ptw
0 1 4.0 1237.1266499803169 1237.1266499803169
1 2 5.0 2574.0331639315027 2574.0331639315027

It is also possible to carry this out without having to place the custom function in R’s global environment, although this is not straightforward.

# First we delete our Python callback in globalenv to
# ensure that we are picking up our callback in our
# specific environment rather than this one.
del(globalenv['mean_np'])

from rpy2.robjects import Environment
my_env = Environment()
my_env['mean_np'] = mean_np


# Create an rlang "quosure" object within
# a given environment. We use the R package
# rlang used by dplyr.
from rpy2.robjects.lib.dplyr import rlang

myquo = rlang.quo.rcall(
    [(None, rl('mean_np(rlang::enexpr(powertoweight))'))],
    environment=my_env
)

dataf = (
    DataFrame(mtcars)
    .filter(rl('gear>3'))
    .mutate(powertoweight=rl('hp*36/wt'))
    .group_by(rl('gear'))
    .summarize(
        mean_ptw=rl('mean(powertoweight)'),
        mean_np_ptw=myquo)
)

dataf
DataFrame with 2 rows and 3 columns:
gear mean_ptw mean_np_ptw
0 1 4.0 1237.1266499803169 1237.1266499803169
1 2 5.0 2574.0331639315027 2574.0331639315027

note: rpy2’s interface to dplyr is implementing a fix to the (non-?)issue 1323 (https://github.com/hadley/dplyr/issues/1323)

The seamless translation of transformations to SQL whenever the data are in a table can be used directly. Since we are lifting the original implementation of dplyr, it just works.

from rpy2.robjects.lib.dplyr import dplyr
# in-memory SQLite database broken in dplyr's src_sqlite
# db = dplyr.src_sqlite(":memory:")
import tempfile
with tempfile.NamedTemporaryFile() as db_fh:
    db = dplyr.src_sqlite(db_fh.name)
    # copy the table to that database
    dataf_db = DataFrame(mtcars).copy_to(db, name="mtcars")
    res = (dataf_db
           .filter(rl('gear>3'))
           .mutate(powertoweight=rl('hp*36/wt'))
           .group_by(rl('gear'))
           .summarize(mean_ptw=rl('mean(powertoweight)')))
    print(res)
#
# Source:   SQL [2 x 2]
# Database: sqlite 3.41.2 [/tmp/tmpxo4an2ls]
   gear mean_ptw
  <dbl>    <dbl>
1     4    1237.
2     5    2574.

Since we are manipulating R objects, anything available to R is also available to us. If we want to see the SQL code generated that’s:

silent = dplyr.show_query(res)
<SQL>
SELECT gear, AVG(powertoweight) AS mean_ptw
FROM (
  SELECT *, (hp * 36.0) / wt AS powertoweight
  FROM mtcars
  WHERE (gear > 3.0)
)
GROUP BY gear

The conversion rules in rpy2 make the above easily applicable to pandas data frames, completing the “lexical loan” of the dplyr vocabulary from R.

from rpy2.robjects import pandas2ri
from rpy2.robjects import default_converter

# Using a conversion context in which the pandas conversion is
# added to the default conversion rules, the rpy2 object
# `mtcars` (an R data frame) is converted to a pandas data frame.
with (default_converter + pandas2ri.converter).context() as cv:
    pd_mtcars = mtcars_env['mtcars']
print(type(pd_mtcars))
<class 'pandas.core.frame.DataFrame'>

Using a local conversion context lets us also go from the pandas data frame to our dplyr-augmented R data frame and use the dplyr transformations on it.

with (default_converter + pandas2ri.converter).context() as cv:
    dataf = (DataFrame(pd_mtcars)
             .filter(rl('gear>=3'))
             .mutate(powertoweight=rl('hp*36/wt'))
             .group_by(rl('gear'))
             .summarize(mean_ptw=rl('mean(powertoweight)')))

dataf
DataFrame with 1 rows and 1 columns:
mean_ptw
0 1 1632.0477884748632

Reuse. Get things done. Don’t reimplement.

tidyr

Note

This section is available as a jupyter notebook tidyr.ipynb (HTML render: tidyr.html)

from functools import partial
from rpy2.ipython import html
html.html_rdataframe=partial(html.html_rdataframe, table_class="docutils")

tidyr in Python

from rpy2.robjects.lib.tidyr import DataFrame
/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/rpy2/robjects/lib/dplyr.py:27: UserWarning: This was designed againt dplyr versions starting with 1.0 but you have 1.1.2
  warnings.warn(
/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/rpy2/robjects/lib/tidyr.py:12: UserWarning: This was designed againt tidyr versions starting with 1.2. but you have 1.3.0
  warnings.warn(

(note: dplyr is implicitly used by tidyr.)

In addition to that, and because this tutorial is in a notebook, we initialize HTML rendering for R objects (pretty display of R data frames).

import rpy2.ipython.html
rpy2.ipython.html.init_printing()
from collections import OrderedDict
from rpy2.robjects.vectors import (StrVector,
                                   IntVector)
dataf = DataFrame(OrderedDict(x=StrVector(("a", "b", "b")),
                              y=IntVector((3, 4, 5)),
                          z=IntVector((6, 7, 8))))
dataf
DataFrame with 3 rows and 3 columns:
x y z
0 1 a 3 6
1 2 b 4 7
2 3 b 5 8
dataf.spread('x', 'y')
DataFrame with 3 rows and 3 columns:
z a b
0 1 6 3 NA_integer_
1 2 7 NA_integer_ 4
2 3 8 NA_integer_ 5

Reuse. Get things done. Don’t reimplement.