Quickstart

Style 1: Context managers

>>> import pandas as pd
>>> from dppd import dppd
>>> from plotnine.data import mtcars
>>> with dppd(mtcars) as (dp, X):  # note parentheses!
...   dp.select(['name', 'hp', 'cyl'])
...   dp.filter_by(X.hp > 100).head(1)

>>> print(X.head())
        name   hp  cyl
0  Mazda RX4  110    6
>>> print(isinstance(X, pd.DataFrame))
True
>>> type(X)
<class 'dppd.base.DPPDAwareProxy'>
>>> print(len(X))
1
>>>m2 = X.pd
>>>type(m2)
<class 'pandas.core.frame.DataFrame'>

Within the context manager, dp is always the latest Dppd object and X is always the latest intermediate DataFrame. Once the context manager has ended, both variables (dp and X here) point to a proxy of the final DataFrame object.

That proxy should, thanks to wrapt , behave just like DataFrames, except that they have a property ‘.pd’ that returns the real DataFrame object.

Style 2: dp…..pd

>>>import pandas as pd
>>>from dppd import dppd
>>>from plotnine.data import mtcars
>>>dp, X = dppd()

>>> mt2 = (dp(mtcars)
  .select(['name', 'hp', 'cyl'])
  .filter_by(X.hp > 100)
  .head()
  .pd
)
>>> print(mt2.head())
                name   hp  cyl
0          Mazda RX4  110    6
1      Mazda RX4 Wag  110    6
3     Hornet 4 Drive  110    6
4  Hornet Sportabout  175    8
5            Valiant  105    6
>>> print(type(mt2))
<class 'pandas.core.frame.DataFrame'>

The inline-style is more casual, but requires the final call .pd to retrieve the DataFrame object, otherwise you have a dppd.Dppd.

How does it work

dppd follows the old adage that there’s only one problem not solvable by another layer of indirection, and achives it’s pipeline-style method chaining by having a proxy object X that always points to the latest DataFrame created in the pipeline.

This allows for example the following:

>>> with dppd(mtcars) as (dp, X):
...   high_kwh  = dp(mtcars).mutate(kwh = X.hp * 0.74).filter_by(X.kwh > 80).iloc[:2].pd
...
>>> high_kwh
                name   mpg  cyl   disp   hp  drat     wt   qsec  vs  am  gear  carb    kwh
0          Mazda RX4  21.0    6  160.0  110  3.90  2.620  16.46   0   1     4     4   81.4
1      Mazda RX4 Wag  21.0    6  160.0  110  3.90  2.875  17.02   0   1     4     4   81.4

Note

Note that at this point (X == high_khw).all() and (dp == high_khw).all().

This approach is different to dplyr and other python implementations of the ‘grammar of data manipulation’ - see comparisons.

Dppd also contains a single-dispatch mechanism to avoid monkey patching. See the section on extending dppd

What’s next?

To learn more please refer to the sections on Dpplyr verbs, dppd verbs and grouping.