dppd package¶
Submodules¶
dppd.base module¶
-
class
dppd.base.
Dppd
(df, dppd_proxy, X, parent)[source]¶ Bases:
object
Dataframe maniPulater maniPulates Dataframes
A DataFrame manipulation object, offering verbs, and each verb returns another Dppd.
All pandas.DataFrame methods have been turned into verbs. Accessors like loc also work.
-
pd
¶ Return the actual, unproxyied DataFrame
-
-
class
dppd.base.
dppd
(df=None)[source]¶ Bases:
object
Context manager for Dppd.
Usage:
``` with cdp(mtcars) as (dp, X): dp.groupby('cyl') dp.arrange(X.hp)) dp.head(1) print(X) ```
Both X and dp are a proxyied DataFrame after the context manager. They should work just like a DataFrame, use X.pd() to convert it into a true DataFrame.
Alternate usage:
dp, X = dppd() dp(df).mutate(y=X['column'] * 2, ...).filter(...).select(...).pd
or:
dp(df).mutate(...) dp.filter() dp.select() new_df = dp.pd
-
dppd.base.
register_property
(name, types=None)[source]¶ Register a property/indexed accessor to be forwarded (.something[])
-
class
dppd.base.
register_verb
(name=None, types=None, pass_dppd=False)[source]¶ Bases:
object
Register a function to act as a Dppd verb. First parameter of the function must be the DataFrame being worked on. Note that for grouped Dppds, the function get’s called once per group.
Example:
register_verb('upper_case_all_columns')( lambda df: df.assign(**{ name: df[name].str.upper() for name in df.columns})
dppd.column_spec module¶
-
dppd.column_spec.
parse_column_specification
(df, column_spec, return_list=False)[source]¶ Parse a column specification
Parameters: - column_spec (various) –
- str, [str] - select columns by name, (always returns an DataFrame, never a Series)
- [b, a, -c, True] - select b, a (by name, in order) drop c, then add everything else in alphabetical order
- pd.Series / np.ndarray, dtype == bool: select columns matching this bool vector, example:
select(X.name.str.startswith('c'))
- pd.Series, [pd.Series] - select columns by series.name
- ”-column_name” or [“-column_name1”,”-column_name2”]: drop all other columns (or invert search order in arrange)
- pd.Index - interpreted as a list of column names - example: select(X.select_dtypes(int).columns)
- (regexps_str, ) tuple - run re.search() on each column name
- (regexps_str, None, regexps_str ) tuple - run re.search() on each level of the column names. Logical and (like DataFrame.xs but more so).
- {level: regexs_str,…) - re.search on these levels (logical and)
- a callable f, which takes a string column name and returns a bool whether to include the column.
- a type, in which case the request will be forwarded to pandas.DataFrame.select_dtypes(include=…)). Example: numpy.number
- None -> all columns
- return_list (int) –
- If return_list is falsiy, return a boolean vector.
- If return_list is True, return a list of columns, either in input order (if available), or in df_columns order if not.
- if return_list is 2, return (forward_list, reverse_list) if input was a list, other wise see ‘return_list is True’
- column_spec (various) –
dppd.non_df_verbs module¶
dppd.single_verbs module¶
-
dppd.single_verbs.
add_count
(df)[source]¶ Verb: Add the cardinality of a row’s group to the row as column ‘count’
-
dppd.single_verbs.
arrange_DataFrame
(df, column_spec, kind='quicksort', na_position='last')[source]¶ Sort DataFrame based on column spec.
Wrapper around sort_values
Parameters: - column_spec (column specification) – see
dppd.single_verbs.parse_column_specification()
- .. (see
pandas.DataFrame.sort_values()
) –
- column_spec (column specification) – see
-
dppd.single_verbs.
arrange_DataFrameGroupBy
(grp, column_spec, kind='quicksort', na_position='last')[source]¶
-
dppd.single_verbs.
binarize
(df, col_spec, drop=True)[source]¶ Convert categorical columns into ‘regression columns’, i.e. X with values a,b,c becomes three binary columns X-a, X-b, X-c which are True exactly where X was a, etc.
-
dppd.single_verbs.
categorize_DataFrame
(df, columns=None, categories=<object object>, ordered=None)[source]¶ Turn columns into pandas.Categorical. By default, they get ordered by their occurrences in the column. You can pass False, then pd.Categorical will sort alphabetically, or ‘natsorted’, in which case they’ll be passed through natsort.natsorted
-
dppd.single_verbs.
colspec_DataFrame
(df, columns, invert=False)[source]¶ Return columns as defined by your column specification, so you can use colspec in set_index etc
- column specification, see
dppd.single_verbs.parse_column_specification()
- column specification, see
-
dppd.single_verbs.
concat_DataFrame
(df, other, axis=0)[source]¶ Verb: Concat this and one ore multiple others.
Wrapper around
pandas.concat()
.Parameters: - other (df or [df, df, ..]) –
- axis (join on rows (axis= 0) or columns (axis = 1)) –
-
dppd.single_verbs.
distinct_dataframe
(df, column_spec=None, keep='first')[source]¶ Verb: select distinct/unique rows
Parameters: - column_spec (column specification) – only consider these columns when deciding on duplication
see
dppd.single_verbs.parse_column_specification()
- keep (str) – which instance to keep in case of duplicates (see
pandas.DataFrame.duplicated()
)
Returns: with possibly fewer rows, but unchanged columns.
Return type: DataFrame
- column_spec (column specification) – only consider these columns when deciding on duplication
see
-
dppd.single_verbs.
distinct_series
(df, keep='first')[source]¶ Verb: select distinct values from Series
Parameters: keep (which instance to keep in case of duplicates (see pandas.Series.duplicated()
)) –
-
dppd.single_verbs.
do
(obj, func, *args, **kwargs)[source]¶ Verb: Do anything to any DataFrame, returning new dataframes
Apply func to each group, collect results, concat them into a new DataFrame with the group information.
Parameters: func (callable) – Should take and return a DataFrame Example:
>>> def count_and_count_unique(df): ... return pd.DataFrame({"count": [len(df)], "unique": [(~df.duplicated()).sum()]}) ... >>> dp(mtcars).select(['cyl','hp']).group_by('cyl').do(count_and_count_unique).pd cyl count unique 0 4 11 10 1 6 7 4 2 8 14 9
-
dppd.single_verbs.
filter_by
(obj, filter_arg)[source]¶ Filter DataFrame
Parameters: filter_arg ( Series
orarray
or callable or dict or str) – # * Series/Array dtype==bool: return by .loc[filter_arg] * callable: Excepted to return a Series(dtype=bool) * str: a column name -> .loc[X[filter_arg].astype(bool)]
-
dppd.single_verbs.
gather
(df, key, value, value_var_column_spec=None)[source]¶ Verb: Gather multiple columns and collapse them into two.
This used to be called melting and this is a column spec aware forward for pd.melt
Paramter order is from dplyr.
Parameters: Inverse of
dppd.single_verbs.spread
.Example
>>> dp(mtcars).select(['name','hp', 'cyl']).gather('variable', 'value', '-name').pd.head() name variable value 0 Mazda RX4 hp 110 1 Mazda RX4 Wag hp 110 2 Datsun 710 hp 93 3 Hornet 4 Drive hp 110 4 Hornet Sportabout hp 175
-
dppd.single_verbs.
mutate_DataFrame
(df, **kwargs)[source]¶ Verb: add columns to a DataFrame defined by kwargs:
Parameters: kwargs (scalar, pd.Series, callable, dict) – - scalar, pd.Series -> assign column
- callable - call callable(df) and assign result
- dict (None: column) - result of itergroups on non-grouped DF to have parity with mutate_DataFrameGroupBy
Examples
add a rank for one column:
dp(mtcars).mutate(hp_rank = X.hp.rank)
rank all columns:
# dict comprehension for illustrative purposes dp(mtcars).mutate(**{f"{column}_rank": X[column].rank() for column in X.columns}).pd # more efficient dp(mtcars).rank().pd()
one rank per group using callback:
dp(mtcars).group_by('cyl').mutate(rank = lambda X: X['hp'].rank()).pd
add_count variant 1 (see
dppd.single_verbs.add_count()
):dp(mtcars).group_by('cyl').mutate(count=lambda x: len(x)).pd
add_count variant 2:
dp(mtcars).group_by('cyl').mutate(count={grp: len(sub_df) for (grp, sub_df) in X.itergroups()}).pd
-
dppd.single_verbs.
mutate_DataFrameGroupBy
(grp, **kwargs)[source]¶ Verb: add columns to the DataFrame used in the GroupBy.
Parameters: **kwargs (scalar, pd.Series, callable, dict) – - scalar, pd.Series -> assign column
- callable - call callable once per group (sub_df) and assign result
- dict {grp_key: scalar_or_series}: assign this (these) value(s) for the group. Use in conjunction with dppd.Dppd.itergroups.
-
dppd.single_verbs.
norm_0_to_1
(df, axis=1)[source]¶ Normalize a (numeric) data frame so that it goes from 0 to 1 in each row (axis=1) or column (axis=0) Usefully for PCA, correlation, etc. because then the dimensions are comparable in size
-
dppd.single_verbs.
norm_zscore
(df, axis=1)[source]¶ apply zcore transform (X - mu) / std via scipy.stats.zcore an the given axis
-
dppd.single_verbs.
pca_dataframe
(df, whiten=False, random_state=None)[source]¶ Perform 2 component PCA using sklearn.decomposition.PCA. Expects samples in rows! Returns a tuple (DataFrame{sample, 1st, 2nd}, whith an additiona l, explained_variance_ratio_ attribute
-
dppd.single_verbs.
reset_columns_DataFrame
(df, new_columns=None)[source]¶ Rename all columns in a dataframe (and return a copy). Possible new_columns values:
- None: df.columns = list(df.columns)
- List: df.columns = new_columns
- callable: df.columns = [new_columns(x) for x in df.columns]
- str && df.shape[1] == 1: df.columns = [new_columns]
new_columns=None is useful when you were transposing categorical indices and now can no longer assign columns. (Arguably a pandas bug)
-
dppd.single_verbs.
select_DataFrame
(df, columns)[source]¶ Verb: Pick columns from a DataFrame
Improved variant of df[columns]
Parameters: - colummns (column specifiation or dict) – see
dppd.single_verbs.parse_column_specification()
- the previous 'rename on dict' behaviour, see select_and_rename ((for) –
- colummns (column specifiation or dict) – see
-
dppd.single_verbs.
select_and_rename_DataFrame
(df, columns)[source]¶ Verb: Pick columns from a DataFrame, and rename them in the process
Parameters: columns (dict {new_name: 'old_name'} - select and rename. old_name may be a str, or a) – Series
(in which case the .name attribute is used)
-
dppd.single_verbs.
seperate
(df, column, new_names, sep='.', remove=False)[source]¶ Verb: split strings on a seperator.
Inverse of
unite()
Parameters:
-
dppd.single_verbs.
sort_values_DataFrameGroupBy
(grp, column_spec, kind='quicksort', na_position='last')[source]¶ Alias for arrange for groupby-objects
-
dppd.single_verbs.
spread
(df, key, value)[source]¶ Verb: Spread a key-value pair across multiple columns
Parameters: Inverse of
dppd.single_verbs.gather
.Example
>>> df = pd.DataFrame({'key': ['a','b'] * 5, 'id': ['c','c','d','d','e','e','f','f','g','g'], 'value':np.random.rand(10)}) >>> dp(df).spread('key','value') >>> dp(df).spread('key','value').pd key id a b 0 c 0.650358 0.931324 1 d 0.633024 0.380125 2 e 0.983000 0.367837 3 f 0.989504 0.706933 4 g 0.245418 0.108165
-
dppd.single_verbs.
summarize
(obj, *args)[source]¶ Summarize by group.
Parameters: *args (tuples) – (column_to_use, function_to_call) or (column_to_use, function_to_call, new_column_name)
-
dppd.single_verbs.
to_frame_dict
(d, **kwargs)[source]¶ pd.DataFrame.from_dict(d, **kwargs)
, so you can saydp({}).to_frame()
-
dppd.single_verbs.
transassign
(df, **kwargs)[source]¶ Verb: Creates a new dataframe from the columns of the old.
This means that the index and row count is preserved
-
dppd.single_verbs.
unite
(df, column_spec, sep='_')[source]¶ Verb: string join multiple columns
Parameters: - column_spec (column_spec) – which columns to join.
see
dppd.single_verbs.parse_column_specification()
- sep (str) – Seperator to join on
- column_spec (column_spec) – which columns to join.
see