grouper#

Set of functions to construct groups in a dataset, and compute analytics pre group during e.g. evaluation

Module Attributes

group

Type alias to define a group

group_list

Group list is either a group or an iterable of groups

Functions

cut_group

Cut a dataframe according to one of its column values and criteria See pandas.cut(), pandas.qcut()

get_group_names

From a list of groups, get the list of associated names.

group_relational_data

Create groups that will be applied on input_data with the pandas.DataFrame.groupby() method.

groups_to_list

Convert a single group or Sequence of groups to a list of groups (possibly with one element)

make_pandas_compatible

Construct group from group that will be used for pandas' groupby method.

Classes

ContinuousGroup(name[, bins, qcut, log, ...])

Data Class to encapsulate information to give to the cutting function of pandas as parameters, typically used to group continuous data by a limited number of groups, similarly to an histogram.

class ContinuousGroup(name: str, bins: float | list[float] = 10, qcut: bool = False, log: bool = False, label_type: str = 'intervals')[source]#

Data Class to encapsulate information to give to the cutting function of pandas as parameters, typically used to group continuous data by a limited number of groups, similarly to an histogram.

Depending on the attributes, il will use either pandas.cut() or pandas.qcut() to give a particular label for each row of you dataframe

bins: float | list[float] = 10#

value given to the bin parameter of pandas’ function. Can be either a float (for the number of bins), or a list of values that will be used as actual bins. Note that in the case of pandas.qcut(), only this attribute being a float makes sense.

label_type: str = 'intervals'#

What type of label to give to each group given by the cutting function.

Can be either:
  • “intervals” (default): pandas.Interval object usually given as Series values by pandas.cut() and pandas.qcut()

  • “mid”: mid point between the two bins of each interval

  • “mean”: mean value of data points comprised in a given interval

  • “median”: median value of data points comprised in a given interval

log: bool = False#

When using cut (and not qcut), whether to separate bins equally in the linear space or the log space. As such, bins for lower values would be closer to each other

name: str#

Name of the column to use the cutting function on

qcut: bool = False#

Whether to use pandas.qcut() or pandas.cut(). Qcut will design the bins so that each interval will contain the same number of samples, while cut will design the bins so that first and last bins are minimum and maximum value of considered column, and all the bins are equally spaced (similar to numpy.linspace())

to_dict() dict[str, str | float | list[float] | bool][source]#

Serialize the ContinuousGroup object into a dictionary that can then be used as kwargs for cut_group()

Returns:

Dictionary containing parameters to be read by cut_group()

cut_group(data: Series | DataFrame, group_name: str | None = None, bins: int | Iterable[float] = 10, label_type: str = 'intervals', log: bool = False, qcut: bool = False) Series[source]#

Cut a dataframe according to one of its column values and criteria See pandas.cut(), pandas.qcut()

Parameters:
  • data – Dataframe to extract the column name from

  • group_name – name of the column to extract

  • bins – parameter used by both pandas.cut(), pandas.qcut(). Namely, it can be an int to describe the number of bins, or a list of floats, to either describe the actual bin edges for pandas.cut() or the quantile edges for pandas.qcut()

  • label_type

    what type of label to give to each group given by the cutting function. Can be either:

    • ”intervals” (default): pandas.Interval object usually given as Series values by pandas.cut() and pandas.qcut()

    • ”mid”: mid-point between the two bins of each interval

    • ”mean”: mean value of data points comprised in a given interval

    • ”median”: median value of data points comprised in a given interval

  • log – Whether to use logarithmic scale or not, when bins is an integer. Useful when the values are not uniformly distributed. Defaults to False.

  • qcut – Whether to use pandas.qcut() instead of pandas.cut(). See corresponding documentation for the differences. TL;DR, pandas.qcut() is based on quantiles (same number of occurrences in each bin) while pandas.cut() is based on values (same interval length for each bin). Defaults to False.

Raises:

ValueError – Raises an error when log option is selected but the extracted column has negative values

Returns:

Series with the same length as data, describing a mapping from id to bin. Bin labels are Interval Indices describing the upper and lower bound. See pandas.IntervalIndex

get_group_names(groups: str | ContinuousGroup | Sequence[str | ContinuousGroup]) list[str][source]#

From a list of groups, get the list of associated names.

Parameters:

groups – single group lor Sequence of groups to extract the names from.

Returns:

Names of given groups.

group = str | lours.utils.grouper.ContinuousGroup#

Type alias to define a group

Group is either

  • the name of a column (for discret groups, such as category_id)

  • a ContinuousGroup object to divide continuous data into a given number of groups, similar to histograms.

these parameters will be used for the function lours.util.grouper.cut_group()

Examples

Discret group:

"size"

Continuous group:

continuousGroup(name="size", bins=10, log=False, qcut=True)

Continuous group with bins:

continuousGroup(name="size", bins=[0, 10, 20, 30], log=False, qcut=False)
group_list = str | lours.utils.grouper.ContinuousGroup | collections.abc.Sequence[str | lours.utils.grouper.ContinuousGroup]#

Group list is either a group or an iterable of groups

group_relational_data(input_data: DataFrame, groups: str | ContinuousGroup | Sequence[str | ContinuousGroup], root_data: DataFrame | None = None, key_to_root: str = 'image_id') tuple[dict[str, str | Series], list[str], list[str]][source]#

Create groups that will be applied on input_data with the pandas.DataFrame.groupby() method. can be used with a root_data relational DataFrame containing values that we might want to group, provided input_data contains a column with reference to a row in root_data.

Parameters:
  • input_data – DataFrame to group

  • groups – groups to apply to input_data or root_data`. Can be a simple string in the case of categorical data, or a dictionary. See group.

  • root_data – DataFrame containing information input_data may refer to. Defaults to None.

  • key_to_root – column name in input_data for the key to root_data. Defaults to “image_id”.

Returns:

  1. A dictionary with the created groups and their name as a key. The groups can

    be directly used in a input_data.groupby call

  2. A list of all category groups, where different values are independent

    from each other

  3. A list of all continuous groups, on which different values represent ranges

    of a continuous value, constructing a discretized histogram

Note that the two list together should be as long as the group dictionary, and their elements must refer to all the actual keys of the dictionary.

Return type:

3 different objects are returned

groups_to_list(groups: str | ContinuousGroup | Sequence[str | ContinuousGroup]) list[str | ContinuousGroup][source]#

Convert a single group or Sequence of groups to a list of groups (possibly with one element)

Parameters:

groups – Sequence of groups or single groups to convert

Returns:

Actual list of groups, more easily handled by other functions.

make_pandas_compatible(data: DataFrame, g: str) tuple[str, str, Literal[True]][source]#
make_pandas_compatible(data: DataFrame, g: ContinuousGroup, root_data: DataFrame | None = None, key_to_root: str = 'image_id') tuple[str, Series, Literal[False]]
make_pandas_compatible(data: DataFrame, g: str | ContinuousGroup, root_data: DataFrame | None = None, key_to_root: str = 'image_id') tuple[str, str | Series, bool]

Construct group from group that will be used for pandas’ groupby method.

  • In the case it’s only a name, keep it like that

  • Otherwise, we need to construct an index of data cut according to the given bins. This will create a pandas.Series with categorical data

Parameters:
  • data – input DataFrame, must contain the column considered in group g

  • g – group depicting a column from data with potential bins. See group

  • root_data – Potential root data where some ids in data refer to a particular. columns in root_data. Defaults to None.

  • key_to_root – column containing root_data row ids. Defaults to “image_id”.

Returns:

  1. group name

  2. group that can be understood by pandas’ groupby method. Can be a simple string referring to a column, or a pandas.Series with categorical data

  3. boolean indicating whether the group is categorical (on which different values are independent of each other) or continuous (on which different values represent ranges of a continuous value, constructing a discretized histogram)

Return type:

Tuple with the 3 following values