Groups and summaries¶
Dppd’s grouping is based on pandas.DataFrame.groupby()
,
which is supported in the fluent api:
>>>dp(mtcars).groupby('cyl').mean().filter_by(X.hp>100).select(['mpg', 'disp', 'hp']).pd
mpg disp hp
cyl
6 19.742857 183.314286 122.285714
8 15.100000 353.100000 209.214286
Select, mutate and filter_by work on the underlying DataFrame:
>>> dp(mtcars).groupby('cyl').select('name').head(1).pd
name cyl
0 Mazda RX4 6
2 Datsun 710 4
4 Hornet Sportabout 8
# Note how selecting on a DataFrameGroupBy does always preserve the grouping columns
During this mutate, X is the DataFrameGroupBy object, and the ranks are per group accordingly:
>>> dp(mtcars).groupby('cyl').mutate(hp_rank=X.hp.rank()).ungroup().select(['name', 'cyl', 'hp', 'hp_rank']).pd.head()
name cyl hp hp_rank
0 Mazda RX4 6 110 3.0
1 Mazda RX4 Wag 6 110 3.0
2 Datsun 710 4 93 7.0
3 Hornet 4 Drive 6 110 3.0
4 Hornet Sportabout 8 175 3.5
And the same in filter_by:
>>> dp(mtcars).groupby('cyl').filter_by(X.hp.rank() <= 2).ungroup().select(['name', 'cyl', 'hp']).pd
name cyl hp
5 Valiant 6 105
7 Merc 240D 4 62
18 Honda Civic 4 52
21 Dodge Challenger 8 150
22 AMC Javelin 8 150
Note that both mutate and filter_by play nice with the callables,
they’re distributed by group - either directly, or via pandas.DataFrameGroupBy.apply()
:
>>> a = dp(mtcars).groupby('cyl').mutate(str_count = lambda x: "%.2i" % len(x)).ungroup().pd
>>> b = dp(mtcars).groupby('cyl').mutate(str_count = X.apply(lambda x: "%.2i" % len(x))).ungroup().pd
>>> (a == b).all().all()
True
>>> a.head()
cyl name mpg disp hp drat wt qsec vs am gear carb str_count
0 6 Mazda RX4 21.0 160.0 110 3.90 2.620 16.46 0 1 4 4 07
1 6 Mazda RX4 Wag 21.0 160.0 110 3.90 2.875 17.02 0 1 4 4 07
2 4 Datsun 710 22.8 108.0 93 3.85 2.320 18.61 1 1 4 1 11
3 6 Hornet 4 Drive 21.4 258.0 110 3.08 3.215 19.44 1 0 3 1 07
4 8 Hornet Sportabout 18.7 360.0 175 3.15 3.440 17.02 0 0 3 2 14
Summaries¶
First off, you can summarize groupby objects with the usual pandas methods
pandas.DataFrame.agg()
, and stay in the pipe:
>>> dp(mtcars).groupby('cyl').agg([np.mean, np.std]).select(['hp', 'gear']).pd
hp gear
mean std mean std
cyl
4 82.636364 20.934530 4.090909 0.539360
6 122.285714 24.260491 3.857143 0.690066
8 209.214286 50.976886 3.285714 0.726273
#note the interaction of select and the MultiIndex column names.
In addition, we have the summarize
verb,
which any number of tuples (column_name, function) or (column_name, function,
new_name) as arguments:
>>> (dp(mtcars).groupby('cyl').summarize(('hp', np.mean), ('hp', np.std), ('gear', np.mean), ('gear', np.std)).pd)
cyl hp_mean hp_std gear_mean gear_std
0 4 82.636364 19.960291 4.090909 0.514259
1 6 122.285714 22.460850 3.857143 0.638877
2 8 209.214286 49.122556 3.285714 0.699854