.. code:: ipython3
    from functools import partial
    from rpy2.ipython import html
    html.html_rdataframe=partial(html.html_rdataframe, table_class="docutils")
dplyr in Python
===============
We need 2 things for this:
1- A data frame (using one of R’s demo datasets).
.. code:: ipython3
    from rpy2.robjects.packages import importr, data
    datasets = importr('datasets')
    mtcars_env = data(datasets).fetch('mtcars')
    mtcars = mtcars_env['mtcars']
In addition to that, and because this tutorial is in a notebook, we
initialize HTML rendering for R objects (pretty display of R data
frames).
.. code:: ipython3
    import rpy2.ipython.html
    rpy2.ipython.html.init_printing()
2- dplyr
.. code:: ipython3
    from rpy2.robjects.lib.dplyr import DataFrame
    from rpy2.robjects import rl
.. parsed-literal::
    /opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/rpy2/robjects/lib/dplyr.py:27: UserWarning: This was designed against dplyr versions starting with 1.0 but you have 1.1.4
      warnings.warn(
With this we have the choice of chaining (D3-style)
.. code:: ipython3
    dataf = (
        DataFrame(mtcars)
        .filter(rl('gear>3'))
        .mutate(powertoweight=rl('hp*36/wt'))
        .group_by(rl('gear'))
        .summarize(mean_ptw=rl('mean(powertoweight)'))
    )
    
    dataf
.. raw:: html
    
    DataFrame with 2 rows and
      2 columns:
    
      
        
          |  |  | gear | mean_ptw | 
      
      
        
          | 0 | 1 | 4.0 | 1237.1266499803169 | 
        
          | 1 | 2 | 5.0 | 2574.0331639315027 | 
      
    
or with pipes (magrittr style).
.. code:: ipython3
    # currently no longer working
    from rpy2.robjects.lib.dplyr import (filter,
                                         mutate,
                                         group_by,
                                         summarize)
    
    if False:
        dataf = (DataFrame(mtcars) >>
                 filter(rl('gear>3')) >>
                 mutate(powertoweight=rl('hp*36/wt')) >>
                 group_by(rl('gear')) >>
                 summarize(mean_ptw=rl('mean(powertoweight)')))
    
        dataf
The function ``rl`` creates unevaluated R language objects, which are
then consummed by the ``dplyr`` function, just like it would be
happening when using ``dplyr`` in R itself. This means that when writing
``mean(powertoweight)`` the R function ``mean()`` is used.
Using a Python function is not too difficult though. We can just call
Python back from R. To achieve this we simply use the decorator
``rternalize``.
.. code:: ipython3
    # Define a python function, and make
    # it a function R can use through `rternalize`
    from rpy2.rinterface import rternalize
    @rternalize(signature=False)
    def mean_np(x):
        import statistics
        return statistics.mean(x)
    
    # Bind that function to a symbol in R's
    # global environment
    from rpy2.robjects import globalenv
    globalenv['mean_np'] = mean_np
    
    # Write a dplyr chain of operations,
    # using our Python function `mean_np`
    dataf = (
        DataFrame(mtcars)
        .filter(rl('gear>3'))
        .mutate(powertoweight=rl('hp*36/wt'))
        .group_by(rl('gear'))
        .summarize(mean_ptw=rl('mean(powertoweight)'),
                   mean_np_ptw=rl('mean_np(powertoweight)'))
    )
    
    dataf
.. raw:: html
    
    DataFrame with 2 rows and
      3 columns:
    
      
        
          |  |  | gear | mean_ptw | mean_np_ptw | 
      
      
        
          | 0 | 1 | 4.0 | 1237.1266499803169 | 1237.1266499803169 | 
        
          | 1 | 2 | 5.0 | 2574.0331639315027 | 2574.0331639315027 | 
      
    
It is also possible to carry this out without having to place the custom
function in R’s global environment, although this is not
straightforward.
.. code:: ipython3
    # First we delete our Python callback in globalenv to
    # ensure that we are picking up our callback in our
    # specific environment rather than this one.
    del(globalenv['mean_np'])
    
    from rpy2.robjects import Environment
    my_env = Environment()
    my_env['mean_np'] = mean_np
    
    
    # Create an rlang "quosure" object within
    # a given environment. We use the R package
    # rlang used by dplyr.
    from rpy2.robjects.lib.dplyr import rlang
    
    myquo = rlang.quo.rcall(
        [(None, rl('mean_np(rlang::enexpr(powertoweight))'))],
        environment=my_env
    )
    
    dataf = (
        DataFrame(mtcars)
        .filter(rl('gear>3'))
        .mutate(powertoweight=rl('hp*36/wt'))
        .group_by(rl('gear'))
        .summarize(
            mean_ptw=rl('mean(powertoweight)'),
            mean_np_ptw=myquo)
    )
    
    dataf
.. raw:: html
    
    DataFrame with 2 rows and
      3 columns:
    
      
        
          |  |  | gear | mean_ptw | mean_np_ptw | 
      
      
        
          | 0 | 1 | 4.0 | 1237.1266499803169 | 1237.1266499803169 | 
        
          | 1 | 2 | 5.0 | 2574.0331639315027 | 2574.0331639315027 | 
      
    
**note**: rpy2’s interface to dplyr is implementing a fix to the
(non-?)issue 1323 (https://github.com/hadley/dplyr/issues/1323)
The seamless translation of transformations to SQL whenever the data are
in a table can be used directly. Since we are lifting the original
implementation of ``dplyr``, it *just works*.
.. code:: ipython3
    from rpy2.robjects.lib.dplyr import dplyr
    # in-memory SQLite database broken in dplyr's src_sqlite
    # db = dplyr.src_sqlite(":memory:")
    import tempfile
    with tempfile.NamedTemporaryFile() as db_fh:
        db = dplyr.src_sqlite(db_fh.name)
        # copy the table to that database
        dataf_db = DataFrame(mtcars).copy_to(db, name="mtcars")
        res = (dataf_db
               .filter(rl('gear>3'))
               .mutate(powertoweight=rl('hp*36/wt'))
               .group_by(rl('gear'))
               .summarize(mean_ptw=rl('mean(powertoweight)')))
        print(res)
    # 
.. parsed-literal::
    # Source:   SQL [2 x 2]
    # Database: sqlite 3.45.0 [/tmp/tmpcojvog4z]
       gear mean_ptw
          
    1     4    1237.
    2     5    2574.
    
Since we are manipulating R objects, anything available to R is also
available to us. If we want to see the SQL code generated that’s:
.. code:: ipython3
    silent = dplyr.show_query(res)
.. parsed-literal::
    
.. parsed-literal::
    SELECT `gear`, AVG(`powertoweight`) AS `mean_ptw`
    FROM (
      SELECT `mtcars`.*, (`hp` * 36.0) / `wt` AS `powertoweight`
      FROM `mtcars`
      WHERE (`gear` > 3.0)
    ) AS `q01`
    GROUP BY `gear`
The conversion rules in rpy2 make the above easily applicable to pandas
data frames, completing the “lexical loan” of the dplyr vocabulary from
R.
.. code:: ipython3
    from rpy2.robjects import pandas2ri
    from rpy2.robjects import default_converter
    
    # Using a conversion context in which the pandas conversion is
    # added to the default conversion rules, the rpy2 object
    # `mtcars` (an R data frame) is converted to a pandas data frame.
    with (default_converter + pandas2ri.converter).context() as cv:
        pd_mtcars = mtcars_env['mtcars']
    print(type(pd_mtcars))
.. parsed-literal::
    
Using a local conversion context lets us also go from the pandas data
frame to our dplyr-augmented R data frame and use the dplyr
transformations on it.
.. code:: ipython3
    with (default_converter + pandas2ri.converter).context() as cv:
        dataf = (DataFrame(pd_mtcars)
                 .filter(rl('gear>=3'))
                 .mutate(powertoweight=rl('hp*36/wt'))
                 .group_by(rl('gear'))
                 .summarize(mean_ptw=rl('mean(powertoweight)')))
    
    dataf
.. raw:: html
    
    DataFrame with 1 rows and
      1 columns:
    
      
        
          |  |  | mean_ptw | 
      
      
        
          | 0 | 1 | 1632.0477884748632 | 
      
    
**Reuse. Get things done. Don’t reimplement.**