dplyr in Python

We need 2 things for this:

1- A data frame (using one of R's demo datasets).

In addition to that, and because this tutorial is in a notebook, we initialize HTML rendering for R objects (pretty display of R data frames).

2- dplyr

With this we have the choice of chaining (D3-style)

or with pipes (magrittr style).

The function rl creates unevaluated R language objects, which are then consummed by the dplyr function, just like it would be happening when using dplyr in R itself. This means that when writing mean(powertoweight) the R function mean() is used.

Using a Python function is not too difficult though. We can just call Python back from R. To achieve this we simply use the decorator rternalize.

It is also possible to carry this out without having to place the custom function in R's global environment, although this is not straightforward.

note: rpy2's interface to dplyr is implementing a fix to the (non-?)issue 1323 (https://github.com/hadley/dplyr/issues/1323)

The seamless translation of transformations to SQL whenever the data are in a table can be used directly. Since we are lifting the original implementation of dplyr, it just works.

Since we are manipulating R objects, anything available to R is also available to us. If we want to see the SQL code generated that's:

The conversion rules in rpy2 make the above easily applicable to pandas data frames, completing the "lexical loan" of the dplyr vocabulary from R.

Using a local converter lets us also go from the pandas data frame to our dplyr-augmented R data frame and use the dplyr transformations on it.

Reuse. Get things done. Don't reimplement.