Dplyr verbs

All dplyr verbs stay ‘in pipeline’ - you can chain them together on a :class:Dppd.

Mutate

Adds new columns.

mutate() takes keyword arguments that are turned into columns on the DataFrame.

Excluding grouping, this is a straigt forward wrapper around pandas.DataFrame.assign().

Example:

>> dp(mtcars).mutate(lower_case_name = X.name.str.lower()).head(1).pd
        name   mpg  cyl   disp   hp  drat    wt   qsec  vs  am  gear  carb lower_case_name
0  Mazda RX4  21.0    6  160.0  110   3.9  2.62  16.46   0   1     4     4       mazda rx4

Select

Pick columns, with optional rename.

Example:

>>>dp(mtcars).select('name').head(1).pd
        name
0  Mazda RX4

>>> dp(mtcars).select([X.name, 'hp']).columns.pd
Index(['name', 'hp'], dtype='object')

>>> dp(mtcars).select(X.columns.str.startswith('c')).columns.pd
Index(['cyl', 'carb'], dtype='object')


>>> dp(mtcars).select(['-hp','-cyl','-am']).columns.pd
Index(['name', 'mpg', 'disp', 'drat', 'wt', 'qsec', 'vs', 'gear', 'carb'], dtype='object')

#renaming
>>> dp(mtcars).select({'model': "name"}).columns.pd
Index(['model'], dtype='object')

See select and column_specification for full details.

Note

This verb shadows pandas.DataFrame.select(), which is deprecated.

filter_by

Filter a DataFrame’s rows.

Examples:

# by a comparison / boolean vector
>>> dp(mtcars).filter_by(X.hp > 100).head(2).pd
            name   mpg  cyl   disp   hp  drat     wt   qsec  vs  am  gear  carb
0      Mazda RX4  21.0    6  160.0  110   3.9  2.620  16.46   0   1     4     4
1  Mazda RX4 Wag  21.0    6  160.0  110   3.9  2.875  17.02   0   1     4     4

# by an existing columns
>>> dp(mtcars).filter_by(X.am).head(2).pd
            name   mpg  cyl   disp   hp  drat     wt   qsec  vs  am  gear  carb
1  Mazda RX4 Wag  21.0    6  160.0  110   3.9  2.875  17.02   0   1     4     4
1  Mazda RX4 Wag  21.0    6  160.0  110   3.9  2.875  17.02   0   1     4     4

# by a callback
>>> dp(mtcars).filter_by(lambda X: np.random.rand(len(X)) < 0.5).head(2).pd
        name   mpg  cyl   disp   hp  drat    wt   qsec  vs  am  gear  carb
6  Duster 360  14.3    8  360.0  245  3.21  3.57  15.84   0   0     3     4
7   Merc 240D  24.4    4  146.7   62  3.69  3.19  20.00   1   0     4     2

See filter_by for full details.

Note

This function is not called filter as not to shadow pandas.DataFrame.filter()

arrange

Sort a DataFrame by a column_specification

>>> dp(mtcars).arrange([X.hp, X.qsec], ascending=[False, True]).select(['name','hp','qsec']).head(5).pd
                 name   hp   qsec
30      Maserati Bora  335  14.60
28     Ford Pantera L  264  14.50
23         Camaro Z28  245  15.41
6          Duster 360  245  15.84
16  Chrysler Imperial  230  17.42

summarize

Summarize the columns in a DataFrame with callbacks.

Example:

>>> dp(mtcars).summarize(
...     ('hp', np.min),
...     ('hp', np.max),
...     ('hp', np.mean),
...     ('hp', np.std),
...     ).pd
  hp_amin  hp_amax   hp_mean     hp_std
0       52      335  146.6875  67.483071
>>> dp(mtcars).summarize(
...   ('hp', np.min, 'min(hp)'),
...   ('hp', np.max, 'max(hp)'),
...   ('hp', np.mean, 'mean(hp)'),
...   ('hp', np.std, 'stddev(hp)'),
...   ).pd
  min(hp)  max(hp)  mean(hp)  stddev(hp)
0       52      335  146.6875   67.483071

Summarize is most useful with grouped DataFrames.

do

Map a grouped DataFrame into a concated other DataFrame. Easier shown than explained:

>>> dp(mtcars).groupby('cyl').add_count().ungroup().sort_index().head(5).select(['name','cyl','count']).pd
                name  cyl  count
0          Mazda RX4    6      7
1      Mazda RX4 Wag    6      7
2         Datsun 710    4     11
3     Hornet 4 Drive    6      7
4  Hornet Sportabout    8     14