Comparison of dplyr and various python approaches

There have been various attempts to bring dplyr’s ease of use to pandas DataFrames. This document attempts a ‘Rosetta stone’ style translation and some characterization about the individual libraries.

Please note that I’m not overly familar with each of these libraries, pull requests to improve the ‘translations’ are welcome.

Libraries compared

The magic of dpylr

Non standard evaluation is R’s killer feature that allows you to write statements that are evaluated ‘later’ in a different context, for example in that of your DataFrame. mutate(df, new_column = old_column * 5) for example creates a new dataframe with an additional column, which is set to df$old_column * 5.

Python can approach it with its first-class-functions, but the lambda syntax remains cumbersome in comparison.

To understand how this piece of code works, you need to understand pipeing:

It transforms an expression df >> g >> f >> h into the equivalent h(f(g(df)))

This works around the limitation of R’s object model, which always dispatches on functions and accordingly offers no method-chaining fluent interfaces. It combines beautifully with R’s late-binding seemless currying.

flights_sml %>%
  group_by(year, month, day) %>%
  filter(rank(desc(arr_delay)) < 10)

for example groups a DataFrame of flights by their date, orders a column (descending), turns it into a rank 1..n, and selects those from the original data frame that were in the top 10 (ie. worst delays) on each day.

Dplyr is open in the sense that it’s easy to extend both the set of verbs (trivialy by defining functions taking a dataframe and returning one) and the set of supported objects (for example to database tables, less trivial).

A critique on the existing python dplyr clones (or why dppd)

Most of the dplyr inspired Python libraries try very hard to reproduce the two core aspects of dplyr, pipeing and non-standard-evaluation. Pipeing is usually fairly well implemented but reads unpythonic and is usually accompanied with namespace polution. Non-standard-evaluation is harder to implement correctly, and every single implementation so far serverly limits the expressiveness that Python offers (e.g. no list comprehensions)

As an example, the following code is the dfply equivalent to the flight filtering R code above

from dfply import *
( flights_sml
  >> group_by(X.year, X.month,
  >> filter_by(
    make_symbolic(pd.Series.rank)(X.arr_delay, ascending=False) < 10)

The big insight of dppd is that in many cases, this is not actually non-standard-evaluation that needs to be evaluted later, but simply a ‘variable not available in context’ problem which can be solved with a proxy variable X that always points to the latest DataFrame created. In the other cases, a fallback to functions/lambdas is not that bad / no more ugly than having to wrap the function in a decorator.

Combined with a pythonic method-chaining fluent API, this looks like this

from dppd import dppd
dp, X = dppd()
  .filter_by(X.groupby(['year', 'month', 'day']).arr_delay.rank(ascending=False) < 10)


## Summary notes:

None of the X-marks-the-non-standard-evaluation is complete with regard to python’s capabilitys - they already fail at such basic things as transform a function by a column.

The ‘eval a string’ approach is better in this regard, but still fails for example on list comprehensions.

all: could benefit from a ‘delayed’ row or valuewise callback function, both for columns and for rows. plydata (patsy?) might actually support list comprehensions!


could benefit from loc/iloc ver unclean exprots (dataframe) no concat query supports no Q if_else is unecessary -> replace does the job (case when ditto?)
missing loc verb unclean exports (warnings) the easy of dfpipe must not be underestimated
unnecessary casting, results are not pd.DataFrame, (define rrshift on every method?) unclean exports (types) no concat select really rudimentary / does not take lists? unclear when you can do df >> and when you have to do Dplyframe(df)