{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"from functools import partial\n",
"from rpy2.ipython import html\n",
"html.html_rdataframe=partial(html.html_rdataframe, table_class=\"docutils\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# dplyr in Python\n",
"\n",
"We need 2 things for this:\n",
"\n",
"1- A data frame (using one of R's demo datasets)."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"from rpy2.robjects.packages import importr, data\n",
"datasets = importr('datasets')\n",
"mtcars_env = data(datasets).fetch('mtcars')\n",
"mtcars = mtcars_env['mtcars']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In addition to that, and because this tutorial is in a notebook,\n",
"we initialize HTML rendering for R objects (pretty display of\n",
"R data frames)."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"import rpy2.ipython.html\n",
"rpy2.ipython.html.init_printing()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"2- dplyr"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"from rpy2.robjects.lib.dplyr import DataFrame"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"With this we have the choice of chaining (D3-style)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"DataFrame with 2 rows and\n",
" 2 columns:\n",
"
\n",
" \n",
" \n",
" | \n",
" | \n",
" gear | \n",
" mean_ptw | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" 4.0 | \n",
" 1237.1266499803169 | \n",
"
\n",
" \n",
" 1 | \n",
" 2 | \n",
" 5.0 | \n",
" 2574.0331639315027 | \n",
"
\n",
" \n",
"
"
],
"text/plain": [
"R object with classes: ('tbl_df', 'tbl', 'data.frame') mapped to:\n",
"[FloatSexpVector, FloatSexpVector]\n",
" gear: \n",
" [RTYPES.REALSXP]\n",
" mean_ptw: \n",
" [RTYPES.REALSXP]"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dataf = (DataFrame(mtcars).\n",
" filter('gear>3').\n",
" mutate(powertoweight='hp*36/wt').\n",
" group_by('gear').\n",
" summarize(mean_ptw='mean(powertoweight)'))\n",
"\n",
"dataf"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"or with pipes (magrittr style)."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"# currently no longer working\n",
"from rpy2.robjects.lib.dplyr import (filter,\n",
" mutate,\n",
" group_by,\n",
" summarize)\n",
"\n",
"if False:\n",
" dataf = (DataFrame(mtcars) >>\n",
" filter('gear>3') >>\n",
" mutate(powertoweight='hp*36/wt') >>\n",
" group_by('gear') >>\n",
" summarize(mean_ptw='mean(powertoweight)'))\n",
"\n",
" dataf"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The strings passed to the dplyr function are evaluated as expression,\n",
"just like this is happening when using dplyr in R. This means that\n",
"when writing `mean(powertoweight)` the R function `mean()` is used.\n",
"\n",
"Using a Python function is not too difficult though. We can just\n",
"call Python back from R. To achieve this we simply\n",
"use the decorator `rternalize`."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"DataFrame with 2 rows and\n",
" 3 columns:\n",
"\n",
" \n",
" \n",
" | \n",
" | \n",
" gear | \n",
" mean_ptw | \n",
" mean_np_ptw | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" 4.0 | \n",
" 1237.1266499803169 | \n",
" 1237.1266499803169 | \n",
"
\n",
" \n",
" 1 | \n",
" 2 | \n",
" 5.0 | \n",
" 2574.0331639315027 | \n",
" 2574.0331639315027 | \n",
"
\n",
" \n",
"
"
],
"text/plain": [
"R object with classes: ('tbl_df', 'tbl', 'data.frame') mapped to:\n",
"[FloatSexpVector, FloatSexpVector, FloatSexpVector]\n",
" gear: \n",
" [RTYPES.REALSXP]\n",
" mean_ptw: \n",
" [RTYPES.REALSXP]\n",
" mean_np_ptw: \n",
" [RTYPES.REALSXP]"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Define a python function, and make\n",
"# it a function R can use through `rternalize`\n",
"from rpy2.rinterface import rternalize\n",
"@rternalize\n",
"def mean_np(x):\n",
" import statistics\n",
" return statistics.mean(x)\n",
"\n",
"# Bind that function to a symbol in R's\n",
"# global environment\n",
"from rpy2.robjects import globalenv\n",
"globalenv['mean_np'] = mean_np\n",
"\n",
"# Write a dplyr chain of operations,\n",
"# using our Python function `mean_np`\n",
"dataf = (DataFrame(mtcars).\n",
" filter('gear>3').\n",
" mutate(powertoweight='hp*36/wt').\n",
" group_by('gear').\n",
" summarize(mean_ptw='mean(powertoweight)',\n",
" mean_np_ptw='mean_np(powertoweight)'))\n",
"\n",
"dataf"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It is also possible to carry this out without having to\n",
"place the custom function in R's global environment."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"del(globalenv['mean_np'])"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"DataFrame with 2 rows and\n",
" 3 columns:\n",
"\n",
" \n",
" \n",
" | \n",
" | \n",
" gear | \n",
" mean_ptw | \n",
" mean_np_ptw | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" 4.0 | \n",
" 1237.1266499803169 | \n",
" 1237.1266499803169 | \n",
"
\n",
" \n",
" 1 | \n",
" 2 | \n",
" 5.0 | \n",
" 2574.0331639315027 | \n",
" 2574.0331639315027 | \n",
"
\n",
" \n",
"
"
],
"text/plain": [
"R object with classes: ('tbl_df', 'tbl', 'data.frame') mapped to:\n",
"[FloatSexpVector, FloatSexpVector, FloatSexpVector]\n",
" gear: \n",
" [RTYPES.REALSXP]\n",
" mean_ptw: \n",
" [RTYPES.REALSXP]\n",
" mean_np_ptw: \n",
" [RTYPES.REALSXP]"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from rpy2.robjects.lib.dplyr import StringInEnv\n",
"from rpy2.robjects import Environment\n",
"my_env = Environment()\n",
"my_env['mean_np'] = mean_np\n",
"\n",
"dataf = (DataFrame(mtcars).\n",
" filter('gear>3').\n",
" mutate(powertoweight='hp*36/wt').\n",
" group_by('gear').\n",
" summarize(mean_ptw='mean(powertoweight)',\n",
" mean_np_ptw=StringInEnv('mean_np(powertoweight)',\n",
" my_env)))\n",
"\n",
"dataf"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**note**: rpy2's interface to dplyr is implementing a fix to the (non-?)issue 1323\n",
"(https://github.com/hadley/dplyr/issues/1323)\n",
"\n",
"The seamless translation of transformations to SQL whenever the\n",
"data are in a table can be used directly. Since we are lifting\n",
"the original implementation of `dplyr`, it *just works*."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\u001b[90m# Source: lazy query [?? x 2]\u001b[39m\n",
"\u001b[90m# Database: sqlite 3.29.0 [/tmp/tmprmrxugij]\u001b[39m\n",
" gear mean_ptw\n",
" \u001b[3m\u001b[90m\u001b[39m\u001b[23m \u001b[3m\u001b[90m\u001b[39m\u001b[23m\n",
"\u001b[90m1\u001b[39m 4 \u001b[4m1\u001b[24m237.\n",
"\u001b[90m2\u001b[39m 5 \u001b[4m2\u001b[24m574.\n",
"\n"
]
}
],
"source": [
"from rpy2.robjects.lib.dplyr import dplyr\n",
"# in-memory SQLite database broken in dplyr's src_sqlite\n",
"# db = dplyr.src_sqlite(\":memory:\")\n",
"import tempfile\n",
"with tempfile.NamedTemporaryFile() as db_fh:\n",
" db = dplyr.src_sqlite(db_fh.name)\n",
" # copy the table to that database\n",
" dataf_db = DataFrame(mtcars).copy_to(db, name=\"mtcars\")\n",
" res = (dataf_db.\n",
" filter('gear>3').\n",
" mutate(powertoweight='hp*36/wt').\n",
" group_by('gear').\n",
" summarize(mean_ptw='mean(powertoweight)'))\n",
" print(res)\n",
"# "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Since we are manipulating R objects, anything available to R is also available\n",
"to us. If we want to see the SQL code generated that's:"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"SELECT `gear`, AVG(`powertoweight`) AS `mean_ptw`\n",
"FROM (SELECT `mpg`, `cyl`, `disp`, `hp`, `drat`, `wt`, `qsec`, `vs`, `am`, `gear`, `carb`, `hp` * 36.0 / `wt` AS `powertoweight`\n",
"FROM (SELECT *\n",
"FROM `mtcars`\n",
"WHERE (`gear` > 3.0)))\n",
"GROUP BY `gear`\n"
]
}
],
"source": [
"silent = dplyr.show_query(res)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The conversion rules in rpy2 make the above easily applicable to pandas data frames,\n",
"completing the \"lexical loan\" of the dplyr vocabulary from R."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"attributes": {
"classes": [
"python "
],
"id": ""
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n"
]
}
],
"source": [
"from rpy2.robjects import pandas2ri\n",
"from rpy2.robjects import default_converter\n",
"from rpy2.robjects.conversion import localconverter\n",
"\n",
"# Using a conversion context in which the pandas conversion is\n",
"# added to the default conversion rules, the rpy2 object\n",
"# `mtcars` (an R data frame) is converted to a pandas data frame.\n",
"with localconverter(default_converter + pandas2ri.converter) as cv:\n",
" pd_mtcars = mtcars_env['mtcars']\n",
"print(type(pd_mtcars))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Using a local converter lets us also go from the pandas data frame\n",
"to our dplyr-augmented R data frame and use the dplyr transformations\n",
"on it."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"DataFrame with 1 rows and\n",
" 1 columns:\n",
"\n",
" \n",
" \n",
" | \n",
" | \n",
" mean_ptw | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" 1632.0477884748632 | \n",
"
\n",
" \n",
"
"
],
"text/plain": [
"R object with classes: ('data.frame',) mapped to:\n",
"[FloatSexpVector]\n",
" mean_ptw: \n",
" [RTYPES.REALSXP]"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"with localconverter(default_converter + pandas2ri.converter) as cv:\n",
" dataf = (DataFrame(pd_mtcars).\n",
" filter('gear>=3').\n",
" mutate(powertoweight='hp*36/wt').\n",
" group_by('gear').\n",
" summarize(mean_ptw='mean(powertoweight)'))\n",
"\n",
"dataf"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Reuse. Get things done. Don't reimplement.**"
]
}
],
"metadata": {
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.8"
}
},
"nbformat": 4,
"nbformat_minor": 2
}