column_booleanizer#
Functions
Convert given column in input DataFrame from lists to boolean |
|
Broadcast two dataframes so that they have the same booleanized columns. |
|
Inverse operation of |
|
Given a prefix and a separator, get all columns that start with |
- booleanize(input_df: DataFrame, column_names: str | Iterable[str] | None = None, separator: str = '.', **possible_values: set | None) DataFrame[source]#
Convert given column in input DataFrame from lists to boolean
This is mainly used when a particular attribute can have multiple possible values at once.
Every possible value given will be tested to see if it’s inside every row’s list which will give a boolean column.
In the end, the column will be dropped and N new boolean columns will be created with the name in form
{column_name}{separator}{value}- Parameters:
input_df – DataFrame on which performing the booleanization. The operation is not inplace.
column_names – columns to convert. After conversion, it will be dropped from input DataFrame. Can be either a single string or a list of strings.
separator – character used to separate original column and value. Defaults to ‘.’
**possible_values – kwargs for sets of possible values. Each key in this dictionary must match a column name. If the corresponding value is None, will deduce it from all occurrences in lists of column given by key. Defaults to None.
- Raises:
- Returns:
New dataset with multiple boolean columns in the form
{column_name}{separator}{value}.
- broadcast_booleanization(df1: DataFrame, df2: DataFrame, booleanized_columns1: Iterable[str] = (), booleanized_columns2: Iterable[str] = (), ignore_index: bool = False, separator: str = '.') tuple[DataFrame, DataFrame, set[str]][source]#
Broadcast two dataframes so that they have the same booleanized columns.
Booleanized columns from
df1that are not present indf2will be created and set to False and vice versa.Note: if
ignore_indexis set to False, the overlapping ids will be set to the value in the other dataframe instead of just False- Parameters:
df1 – first dataframe to broadcast
df2 – second dataframe to broadcast
booleanized_columns1 – Columns in
df1that are booleanized. Defaults to ().booleanized_columns2 – Columns in
df2that are booleanized. Defaults to ().ignore_index – if set to True, will create boolean columns full of False regardless of index overlap between the two dataframes. If set to False, tries to retrieve boolean value in one dataframe from the other when creating the column. Defaults to False.
separator – Character used to separate column prefix and value. Defaults to “.”.
- Returns:
tuple containing updated dataframes
df1anddf2with the same booleanized columns
- debooleanize(input_df: DataFrame, column_prefixes: str | Iterable[str], separator: str = '.') DataFrame[source]#
Inverse operation of
booleanize(). Take all columns that start with{column_prefix}{separator}and, assuming they are all boolean columns, convert them into a single column of list values.Note
The column order will be preserved, the debooleanized column will be inserted at the same spot the multiple booleanized columns were.
- Parameters:
input_df – Input DataFrame we will take the columns from.
column_prefixes – Name of column prefix (or prefixes) to retrieve boolean columns. Also, the name of resulting column (or columns)
separator – Character used to separate column prefix and value. Defaults to “.”.
- Raises:
TypeError – all columns with given prefix must be of boolean dtype
- Returns:
- Resulting DataFrame, with all boolean column which name correspond
to the prefix drop and a single column added with lists
- Return type:
pd.DataFrame
- get_bool_columns(input_df: DataFrame, column_prefix: str, separator: str = '.') list[str][source]#
Given a prefix and a separator, get all columns that start with
{column_prefix}{separator}This is used in e.g.
debooleanize()- Parameters:
input_df – DataFrame to get the columns from
column_prefix – Name of column prefix to retrieve boolean columns.
separator – Character used to separate column prefix and value. Defaults to “.”.
- Raises:
ValueError – Raised when column following the pattern are not boolean
- Returns:
List of columns that follow the pattern and will be used to construct the list.