column_booleanizer#

Functions

booleanize

Convert given column in input DataFrame from lists to boolean

broadcast_booleanization

Broadcast two dataframes so that they have the same booleanized columns.

debooleanize

Inverse operation of booleanize().

get_bool_columns

Given a prefix and a separator, get all columns that start with {column_prefix}{separator}

booleanize(input_df: DataFrame, column_names: str | Iterable[str] | None = None, separator: str = '.', **possible_values: set | None) DataFrame[source]#

Convert given column in input DataFrame from lists to boolean

This is mainly used when a particular attribute can have multiple possible values at once.

Every possible value given will be tested to see if it’s inside every row’s list which will give a boolean column.

In the end, the column will be dropped and N new boolean columns will be created with the name in form {column_name}{separator}{value}

Parameters:
  • input_df – DataFrame on which performing the booleanization. The operation is not inplace.

  • column_names – columns to convert. After conversion, it will be dropped from input DataFrame. Can be either a single string or a list of strings.

  • separator – character used to separate original column and value. Defaults to ‘.’

  • **possible_values – kwargs for sets of possible values. Each key in this dictionary must match a column name. If the corresponding value is None, will deduce it from all occurrences in lists of column given by key. Defaults to None.

Raises:
  • KeyError – The given column_name must be in the columns of input_df

  • TypeError – When for a particular column possible values need to be deduced, the column must have value that are all iterable except strings.

Returns:

New dataset with multiple boolean columns in the form {column_name}{separator}{value}.

broadcast_booleanization(df1: DataFrame, df2: DataFrame, booleanized_columns1: Iterable[str] = (), booleanized_columns2: Iterable[str] = (), ignore_index: bool = False, separator: str = '.') tuple[DataFrame, DataFrame, set[str]][source]#

Broadcast two dataframes so that they have the same booleanized columns.

Booleanized columns from df1 that are not present in df2 will be created and set to False and vice versa.

Note: if ignore_index is set to False, the overlapping ids will be set to the value in the other dataframe instead of just False

Parameters:
  • df1 – first dataframe to broadcast

  • df2 – second dataframe to broadcast

  • booleanized_columns1 – Columns in df1 that are booleanized. Defaults to ().

  • booleanized_columns2 – Columns in df2 that are booleanized. Defaults to ().

  • ignore_index – if set to True, will create boolean columns full of False regardless of index overlap between the two dataframes. If set to False, tries to retrieve boolean value in one dataframe from the other when creating the column. Defaults to False.

  • separator – Character used to separate column prefix and value. Defaults to “.”.

Returns:

tuple containing updated dataframes df1 and df2 with the same booleanized columns

debooleanize(input_df: DataFrame, column_prefixes: str | Iterable[str], separator: str = '.') DataFrame[source]#

Inverse operation of booleanize(). Take all columns that start with {column_prefix}{separator} and, assuming they are all boolean columns, convert them into a single column of list values.

Note

The column order will be preserved, the debooleanized column will be inserted at the same spot the multiple booleanized columns were.

Parameters:
  • input_df – Input DataFrame we will take the columns from.

  • column_prefixes – Name of column prefix (or prefixes) to retrieve boolean columns. Also, the name of resulting column (or columns)

  • separator – Character used to separate column prefix and value. Defaults to “.”.

Raises:

TypeError – all columns with given prefix must be of boolean dtype

Returns:

Resulting DataFrame, with all boolean column which name correspond

to the prefix drop and a single column added with lists

Return type:

pd.DataFrame

get_bool_columns(input_df: DataFrame, column_prefix: str, separator: str = '.') list[str][source]#

Given a prefix and a separator, get all columns that start with {column_prefix}{separator}

This is used in e.g. debooleanize()

Parameters:
  • input_df – DataFrame to get the columns from

  • column_prefix – Name of column prefix to retrieve boolean columns.

  • separator – Character used to separate column prefix and value. Defaults to “.”.

Raises:

ValueError – Raised when column following the pattern are not boolean

Returns:

List of columns that follow the pattern and will be used to construct the list.