The why of dppd

Undoubtly, in R dplyr is a highly useful library since many of it’s verbs are not available otherwise.

But pandas, which has been moving to support method chaining in the last few releases already does most of dplyr’s verbs, so why is there half a dozen dplyr clones for python, including this one (see <comparison>)?

Part of it is likely to be historic - the clone projects started before pandas DataFrames were as chainable as they are today.

Another part is the power of R’s non-standard-evaluation, which, if unpythonic, has a certain allure.

Dppd brings three things to pandas:
  • the proxy that always points to the latest DataFrame (or object), which ‘fakes’ non-standard-evaluation at the full power of python
  • filtering on groupby()ed DataFrames
  • R like column specifications for selection and sorting.

Proxy X

X is always the latest object:

>>> dp(mtcars).assign(kwh=X.hp * 0.74).filter_by(X.kwh > 100).head(5).pd
                name   mpg  cyl   disp   hp  drat    wt   qsec  vs  am  gear  carb    kwh
4   Hornet Sportabout  18.7    8  360.0  175  3.15  3.44  17.02   0   0     3     2  129.5
6          Duster 360  14.3    8  360.0  245  3.21  3.57  15.84   0   0     3     4  181.3
11         Merc 450SE  16.4    8  275.8  180  3.07  4.07  17.40   0   0     3     3  133.2
12         Merc 450SL  17.3    8  275.8  180  3.07  3.73  17.60   0   0     3     3  133.2
13        Merc 450SLC  15.2    8  275.8  180  3.07  3.78  18.00   0   0     3     3  133.2

Filtering groupbyed DataFrames

Let’s take the example from our Readme, which calculates the highest cars by kwh from the mtcars dataset (allowing for ties):

 >>> from plotnine.data import mtcars
 >>> from dppd import dppd
 >>> dp, X = dppd()
>>> (dp(mtcars)
...       .mutate(kwh = X.hp * 0.74)
...       .groupby('cyl')
...       .filter_by(X.kwh.rank() < 2)
...       .ungroup().pd
...       )

    cyl              name   mpg   disp   hp  drat     wt   qsec  vs  am  gear  carb     kwh
 5     6           Valiant  18.1  225.0  105  2.76  3.460  20.22   1   0     3     1   77.70
 18    4       Honda Civic  30.4   75.7   52  4.93  1.615  18.52   1   1     4     2   38.48
 21    8  Dodge Challenger  15.5  318.0  150  2.76  3.520  16.87   0   0     3     2  111.00
 22    8       AMC Javelin  15.2  304.0  150  3.15  3.435  17.30   0   0     3     2  111.00

And the pandas equivalent:

>>> mtcars = mtcars.assign(kwh = mtcars['hp'] * 0.74)
>>> ranks = mtcars.groupby('cyl').kwh.rank()
>>> mtcars[ranks < 2]
                name   mpg  cyl   disp   hp  drat     wt   qsec  vs  am  gear  carb     kwh
5            Valiant  18.1    6  225.0  105  2.76  3.460  20.22   1   0     3     1   77.70
18       Honda Civic  30.4    4   75.7   52  4.93  1.615  18.52   1   1     4     2   38.48
21  Dodge Challenger  15.5    8  318.0  150  2.76  3.520  16.87   0   0     3     2  111.00
22       AMC Javelin  15.2    8  304.0  150  3.15  3.435  17.30   0   0     3     2  111.00

Column specifications

Selecting columns in pandas is alread powerful, using the df.columns.str.whatever methods. It is verbose though, and sort_values with it’s ‘ascending’ parameter is way to many characters just to invert the sorting order on a column.

Dppd supports a mini language for column specifications - see dppd.column_spec.parse_column_specification() for details:

# drop column name
>>> dp(mtcars).select('-name').head(1).pd
    mpg  cyl   disp   hp  drat    wt   qsec  vs  am  gear  carb   kwh
0  21.0    6  160.0  110   3.9  2.62  16.46   0   1     4     4  81.4

# sort by hp inverted
>>> dp(mtcars).arrange('-hp').head(2).select(['name','cyl','hp']).pd
          name  cyl  hp
18  Honda Civic    4  52
7     Merc 240D    4  62

Single dispatch ‘clean monkey patching’ engine

Dppd internally is in essence a clean monkey-patching single dispatch engine that allows you to wrap types beyond the DataFrame.e