dppd package

Submodules

dppd.base module

class dppd.base.Dppd(df, dppd_proxy, X, parent)[source]

Bases: object

Dataframe maniPulater maniPulates Dataframes

A DataFrame manipulation object, offering verbs, and each verb returns another Dppd.

All pandas.DataFrame methods have been turned into verbs. Accessors like loc also work.

pd

Return the actual, unproxyied DataFrame

class dppd.base.dppd(df=None)[source]

Bases: object

Context manager for Dppd.

Usage:

```
with cdp(mtcars) as (dp, X):
    dp.groupby('cyl')
    dp.arrange(X.hp))
    dp.head(1)
print(X)
```

Both X and dp are a proxyied DataFrame after the context manager. They should work just like a DataFrame, use X.pd() to convert it into a true DataFrame.

Alternate usage:

dp, X = dppd()
dp(df).mutate(y=X['column'] * 2, ...).filter(...).select(...).pd

or:

dp(df).mutate(...)
dp.filter()
dp.select()
new_df = dp.pd
dppd.base.register_property(name, types=None)[source]

Register a property/indexed accessor to be forwarded (.something[])

dppd.base.register_type_methods_as_verbs(cls, excluded)[source]
class dppd.base.register_verb(name=None, types=None, pass_dppd=False)[source]

Bases: object

Register a function to act as a Dppd verb. First parameter of the function must be the DataFrame being worked on. Note that for grouped Dppds, the function get’s called once per group.

Example:

register_verb('upper_case_all_columns')(
    lambda df: df.assign(**{
        name: df[name].str.upper() for name in df.columns})

dppd.column_spec module

dppd.column_spec.parse_column_specification(df, column_spec, return_list=False)[source]

Parse a column specification

Parameters:
  • column_spec (various) –
    • str, [str] - select columns by name, (always returns an DataFrame, never a Series)
    • [b, a, -c, True] - select b, a (by name, in order) drop c, then add everything else in alphabetical order
    • pd.Series / np.ndarray, dtype == bool: select columns matching this bool vector, example: select(X.name.str.startswith('c'))
    • pd.Series, [pd.Series] - select columns by series.name
    • ”-column_name” or [“-column_name1”,”-column_name2”]: drop all other columns (or invert search order in arrange)
    • pd.Index - interpreted as a list of column names - example: select(X.select_dtypes(int).columns)
    • (regexps_str, ) tuple - run re.search() on each column name
    • (regexps_str, None, regexps_str ) tuple - run re.search() on each level of the column names. Logical and (like DataFrame.xs but more so).
    • {level: regexs_str,…) - re.search on these levels (logical and)
    • a callable f, which takes a string column name and returns a bool whether to include the column.
    • a type, in which case the request will be forwarded to pandas.DataFrame.select_dtypes(include=…)). Example: numpy.number
    • None -> all columns
  • return_list (int) –
    • If return_list is falsiy, return a boolean vector.
    • If return_list is True, return a list of columns, either in input order (if available), or in df_columns order if not.
    • if return_list is 2, return (forward_list, reverse_list) if input was a list, other wise see ‘return_list is True’
dppd.column_spec.series_and_strings_to_names(columns)[source]

dppd.non_df_verbs module

dppd.non_df_verbs.collection_counter_to_df(counter, key_name='key', count_name='counts')[source]

Turn a collections.Counter into a DataFrame with two columns: key & count

dppd.single_verbs module

dppd.single_verbs.add_count(df)[source]

Verb: Add the cardinality of a row’s group to the row as column ‘count’

dppd.single_verbs.arrange_DataFrame(df, column_spec, kind='quicksort', na_position='last')[source]

Sort DataFrame based on column spec.

Wrapper around sort_values

Parameters:
dppd.single_verbs.arrange_DataFrameGroupBy(grp, column_spec, kind='quicksort', na_position='last')[source]
dppd.single_verbs.astype_DataFrame(df, columns, dtype, **kwargs)[source]
dppd.single_verbs.binarize(df, col_spec, drop=True)[source]

Convert categorical columns into ‘regression columns’, i.e. X with values a,b,c becomes three binary columns X-a, X-b, X-c which are True exactly where X was a, etc.

dppd.single_verbs.categorize_DataFrame(df, columns=None, categories=<object object>, ordered=None)[source]

Turn columns into pandas.Categorical. By default, they get ordered by their occurrences in the column. You can pass False, then pd.Categorical will sort alphabetically, or ‘natsorted’, in which case they’ll be passed through natsort.natsorted

dppd.single_verbs.colspec_DataFrame(df, columns, invert=False)[source]

Return columns as defined by your column specification, so you can use colspec in set_index etc

  • column specification, see dppd.single_verbs.parse_column_specification()
dppd.single_verbs.concat_DataFrame(df, other, axis=0)[source]

Verb: Concat this and one ore multiple others.

Wrapper around pandas.concat().

Parameters:
  • other (df or [df, df, ..]) –
  • axis (join on rows (axis= 0) or columns (axis = 1)) –
dppd.single_verbs.distinct_dataframe(df, column_spec=None, keep='first')[source]

Verb: select distinct/unique rows

Parameters:
  • column_spec (column specification) – only consider these columns when deciding on duplication see dppd.single_verbs.parse_column_specification()
  • keep (str) – which instance to keep in case of duplicates (see pandas.DataFrame.duplicated())
Returns:

with possibly fewer rows, but unchanged columns.

Return type:

DataFrame

dppd.single_verbs.distinct_series(df, keep='first')[source]

Verb: select distinct values from Series

Parameters:keep (which instance to keep in case of duplicates (see pandas.Series.duplicated())) –
dppd.single_verbs.do(obj, func, *args, **kwargs)[source]

Verb: Do anything to any DataFrame, returning new dataframes

Apply func to each group, collect results, concat them into a new DataFrame with the group information.

Parameters:func (callable) – Should take and return a DataFrame

Example:

>>> def count_and_count_unique(df):
...     return pd.DataFrame({"count": [len(df)], "unique": [(~df.duplicated()).sum()]})
...
>>> dp(mtcars).select(['cyl','hp']).group_by('cyl').do(count_and_count_unique).pd
cyl  count  unique
0    4     11      10
1    6      7       4
2    8     14       9
dppd.single_verbs.drop_DataFrameGroupBy(grp, *args, **kwargs)[source]
dppd.single_verbs.ends(df, n=5)[source]

Head(n)&Tail(n) at once

dppd.single_verbs.filter_by(obj, filter_arg)[source]

Filter DataFrame

Parameters:filter_arg (Series or array or callable or dict or str) – # * Series/Array dtype==bool: return by .loc[filter_arg] * callable: Excepted to return a Series(dtype=bool) * str: a column name -> .loc[X[filter_arg].astype(bool)]
dppd.single_verbs.gather(df, key, value, value_var_column_spec=None)[source]

Verb: Gather multiple columns and collapse them into two.

This used to be called melting and this is a column spec aware forward for pd.melt

Paramter order is from dplyr.

Parameters:
  • key (str) – name of the new ‘variable’ column
  • value (str) – name of the new ‘value’ column
  • value_var_column_spec (column specification) – which columns contain the values to be mapped into key/value pairs? see dppd.single_verbs.parse_column_specification()

Inverse of dppd.single_verbs.spread.

Example

>>> dp(mtcars).select(['name','hp', 'cyl']).gather('variable', 'value', '-name').pd.head()
                name variable  value
0          Mazda RX4       hp    110
1      Mazda RX4 Wag       hp    110
2         Datsun 710       hp     93
3     Hornet 4 Drive       hp    110
4  Hornet Sportabout       hp    175
dppd.single_verbs.group_extract_params(grp)[source]
dppd.single_verbs.group_variables(grp)[source]
dppd.single_verbs.identity(df)[source]

Verb: No-op.

dppd.single_verbs.iter_tuples_DataFrameGroupBy(grp)[source]
dppd.single_verbs.itergroups_DataFrame(df)[source]
dppd.single_verbs.itergroups_DataFrameGroupBy(grp)[source]
dppd.single_verbs.log2(df)[source]
dppd.single_verbs.mutate_DataFrame(df, **kwargs)[source]

Verb: add columns to a DataFrame defined by kwargs:

Parameters:kwargs (scalar, pd.Series, callable, dict) –
  • scalar, pd.Series -> assign column
  • callable - call callable(df) and assign result
  • dict (None: column) - result of itergroups on non-grouped DF to have parity with mutate_DataFrameGroupBy

Examples

add a rank for one column:

dp(mtcars).mutate(hp_rank = X.hp.rank)

rank all columns:

# dict comprehension for illustrative purposes
dp(mtcars).mutate(**{f"{column}_rank": X[column].rank() for column in X.columns}).pd
# more efficient
dp(mtcars).rank().pd()

one rank per group using callback:

dp(mtcars).group_by('cyl').mutate(rank = lambda X: X['hp'].rank()).pd

add_count variant 1 (see dppd.single_verbs.add_count()):

dp(mtcars).group_by('cyl').mutate(count=lambda x: len(x)).pd

add_count variant 2:

dp(mtcars).group_by('cyl').mutate(count={grp: len(sub_df) for (grp, sub_df) in X.itergroups()}).pd
dppd.single_verbs.mutate_DataFrameGroupBy(grp, **kwargs)[source]

Verb: add columns to the DataFrame used in the GroupBy.

Parameters:**kwargs (scalar, pd.Series, callable, dict) –
  • scalar, pd.Series -> assign column
  • callable - call callable once per group (sub_df) and assign result
  • dict {grp_key: scalar_or_series}: assign this (these) value(s) for the group. Use in conjunction with dppd.Dppd.itergroups.
dppd.single_verbs.natsort_DataFrame(df, column)[source]
dppd.single_verbs.norm_0_to_1(df, axis=1)[source]

Normalize a (numeric) data frame so that it goes from 0 to 1 in each row (axis=1) or column (axis=0) Usefully for PCA, correlation, etc. because then the dimensions are comparable in size

dppd.single_verbs.norm_zscore(df, axis=1)[source]

apply zcore transform (X - mu) / std via scipy.stats.zcore an the given axis

dppd.single_verbs.pca_dataframe(df, whiten=False, random_state=None)[source]

Perform 2 component PCA using sklearn.decomposition.PCA. Expects samples in rows! Returns a tuple (DataFrame{sample, 1st, 2nd}, whith an additiona l, explained_variance_ratio_ attribute

dppd.single_verbs.print_DataFrameGroupBy(grps)[source]
dppd.single_verbs.reset_columns_DataFrame(df, new_columns=None)[source]

Rename all columns in a dataframe (and return a copy). Possible new_columns values:

  • None: df.columns = list(df.columns)
  • List: df.columns = new_columns
  • callable: df.columns = [new_columns(x) for x in df.columns]
  • str && df.shape[1] == 1: df.columns = [new_columns]

new_columns=None is useful when you were transposing categorical indices and now can no longer assign columns. (Arguably a pandas bug)

dppd.single_verbs.select_DataFrame(df, columns)[source]

Verb: Pick columns from a DataFrame

Improved variant of df[columns]

Parameters:
  • colummns (column specifiation or dict) – see dppd.single_verbs.parse_column_specification()
  • the previous 'rename on dict' behaviour, see select_and_rename ((for) –
dppd.single_verbs.select_DataFrameGroupBy(grp, columns)[source]
dppd.single_verbs.select_and_rename_DataFrame(df, columns)[source]

Verb: Pick columns from a DataFrame, and rename them in the process

Parameters:columns (dict {new_name: 'old_name'} - select and rename. old_name may be a str, or a) – Series (in which case the .name attribute is used)
dppd.single_verbs.seperate(df, column, new_names, sep='.', remove=False)[source]

Verb: split strings on a seperator.

Inverse of unite()

Parameters:
  • column (str or pd.Series) – column to split on (Series.name is named in case of a series)
  • new_names (list) – list of new column names
  • sep (str) – what to split on (pd.Series.str.split)
  • remove (bool) – wether to drop column
dppd.single_verbs.sort_values_DataFrameGroupBy(grp, column_spec, kind='quicksort', na_position='last')[source]

Alias for arrange for groupby-objects

dppd.single_verbs.spread(df, key, value)[source]

Verb: Spread a key-value pair across multiple columns

Parameters:
  • key (str or pd.Series) – key column to spread (if series, .name is used)
  • value (str or pd.Series) – value column to spread (if series, .name is used)

Inverse of dppd.single_verbs.gather.

Example

>>> df = pd.DataFrame({'key': ['a','b'] * 5, 'id': ['c','c','d','d','e','e','f','f','g','g'], 'value':np.random.rand(10)})
>>> dp(df).spread('key','value')
>>> dp(df).spread('key','value').pd
key id         a         b
0    c  0.650358  0.931324
1    d  0.633024  0.380125
2    e  0.983000  0.367837
3    f  0.989504  0.706933
4    g  0.245418  0.108165
dppd.single_verbs.summarize(obj, *args)[source]

Summarize by group.

Parameters:*args (tuples) – (column_to_use, function_to_call) or (column_to_use, function_to_call, new_column_name)
dppd.single_verbs.to_frame_dict(d, **kwargs)[source]

pd.DataFrame.from_dict(d, **kwargs), so you can say dp({}).to_frame()

dppd.single_verbs.transassign(df, **kwargs)[source]

Verb: Creates a new dataframe from the columns of the old.

This means that the index and row count is preserved

dppd.single_verbs.ungroup_DataFrameGroupBy(grp)[source]
dppd.single_verbs.unique_in_order(seq)[source]
dppd.single_verbs.unite(df, column_spec, sep='_')[source]

Verb: string join multiple columns

Parameters:
  • column_spec (column_spec) – which columns to join. see dppd.single_verbs.parse_column_specification()
  • sep (str) – Seperator to join on
dppd.single_verbs.unselect_DataFrame(df, columns)[source]

Verb: Select via an inversed column spec (ie. everything but these)

Parameters:colummns (column specifiation or dict) –
  • column specification, see dppd.single_verbs.parse_column_specification()
dppd.single_verbs.unselect_DataFrameGroupBy(grp, columns)[source]

Verb: Select via an inversed column spec (ie. everything but these)

Parameters:colummns (column specifiation or dict) –
  • column specification, see dppd.single_verbs.parse_column_specification()

Module contents