grouper#
Set of functions to construct groups in a dataset, and compute analytics pre group during e.g. evaluation
Module Attributes
Type alias to define a group |
|
Group list is either a group or an iterable of groups |
Functions
Cut a dataframe according to one of its column values and criteria See |
|
From a list of groups, get the list of associated names. |
|
Create groups that will be applied on |
|
Convert a single group or Sequence of groups to a list of groups (possibly with one element) |
|
Construct group from |
Classes
|
Data Class to encapsulate information to give to the cutting function of pandas as parameters, typically used to group continuous data by a limited number of groups, similarly to an histogram. |
- class ContinuousGroup(name: str, bins: float | list[float] = 10, qcut: bool = False, log: bool = False, label_type: str = 'intervals')[source]#
Data Class to encapsulate information to give to the cutting function of pandas as parameters, typically used to group continuous data by a limited number of groups, similarly to an histogram.
Depending on the attributes, il will use either
pandas.cut()orpandas.qcut()to give a particular label for each row of you dataframe- bins: float | list[float] = 10#
value given to the
binparameter of pandas’ function. Can be either a float (for the number of bins), or a list of values that will be used as actual bins. Note that in the case ofpandas.qcut(), only this attribute being a float makes sense.
- label_type: str = 'intervals'#
What type of label to give to each group given by the cutting function.
- Can be either:
“intervals” (default):
pandas.Intervalobject usually given as Series values bypandas.cut()andpandas.qcut()“mid”: mid point between the two bins of each interval
“mean”: mean value of data points comprised in a given interval
“median”: median value of data points comprised in a given interval
- log: bool = False#
When using cut (and not qcut), whether to separate bins equally in the linear space or the log space. As such, bins for lower values would be closer to each other
- qcut: bool = False#
Whether to use
pandas.qcut()orpandas.cut(). Qcut will design the bins so that each interval will contain the same number of samples, while cut will design the bins so that first and last bins are minimum and maximum value of considered column, and all the bins are equally spaced (similar tonumpy.linspace())
- cut_group(data: Series | DataFrame, group_name: str | None = None, bins: int | Iterable[float] = 10, label_type: str = 'intervals', log: bool = False, qcut: bool = False) Series[source]#
Cut a dataframe according to one of its column values and criteria See
pandas.cut(),pandas.qcut()- Parameters:
data – Dataframe to extract the column name from
group_name – name of the column to extract
bins – parameter used by both
pandas.cut(),pandas.qcut(). Namely, it can be an int to describe the number of bins, or a list of floats, to either describe the actual bin edges forpandas.cut()or the quantile edges forpandas.qcut()label_type –
what type of label to give to each group given by the cutting function. Can be either:
”intervals” (default):
pandas.Intervalobject usually given as Series values bypandas.cut()andpandas.qcut()”mid”: mid-point between the two bins of each interval
”mean”: mean value of data points comprised in a given interval
”median”: median value of data points comprised in a given interval
log – Whether to use logarithmic scale or not, when bins is an integer. Useful when the values are not uniformly distributed. Defaults to False.
qcut – Whether to use
pandas.qcut()instead ofpandas.cut(). See corresponding documentation for the differences. TL;DR,pandas.qcut()is based on quantiles (same number of occurrences in each bin) whilepandas.cut()is based on values (same interval length for each bin). Defaults to False.
- Raises:
ValueError – Raises an error when log option is selected but the extracted column has negative values
- Returns:
Series with the same length as data, describing a mapping from id to bin. Bin labels are Interval Indices describing the upper and lower bound. See
pandas.IntervalIndex
- get_group_names(groups: str | ContinuousGroup | Sequence[str | ContinuousGroup]) list[str][source]#
From a list of groups, get the list of associated names.
- Parameters:
groups – single group lor Sequence of groups to extract the names from.
- Returns:
Names of given groups.
- group = str | lours.utils.grouper.ContinuousGroup#
Type alias to define a group
Group is either
the name of a column (for discret groups, such as
category_id)a
ContinuousGroupobject to divide continuous data into a given number of groups, similar to histograms.
these parameters will be used for the function
lours.util.grouper.cut_group()Examples
Discret group:
"size"Continuous group:
continuousGroup(name="size", bins=10, log=False, qcut=True)
Continuous group with bins:
continuousGroup(name="size", bins=[0, 10, 20, 30], log=False, qcut=False)
- group_list = str | lours.utils.grouper.ContinuousGroup | collections.abc.Sequence[str | lours.utils.grouper.ContinuousGroup]#
Group list is either a group or an iterable of groups
- group_relational_data(input_data: DataFrame, groups: str | ContinuousGroup | Sequence[str | ContinuousGroup], root_data: DataFrame | None = None, key_to_root: str = 'image_id') tuple[dict[str, str | Series], list[str], list[str]][source]#
Create groups that will be applied on
input_datawith thepandas.DataFrame.groupby()method. can be used with aroot_datarelational DataFrame containing values that we might want to group, providedinput_datacontains a column with reference to a row inroot_data.- Parameters:
input_data – DataFrame to group
groups – groups to apply to
input_dataor root_data`. Can be a simple string in the case of categorical data, or a dictionary. Seegroup.root_data – DataFrame containing information
input_datamay refer to. Defaults to None.key_to_root – column name in
input_datafor the key toroot_data. Defaults to “image_id”.
- Returns:
- A dictionary with the created groups and their name as a key. The groups can
be directly used in a input_data.groupby call
- A list of all category groups, where different values are independent
from each other
- A list of all continuous groups, on which different values represent ranges
of a continuous value, constructing a discretized histogram
Note that the two list together should be as long as the group dictionary, and their elements must refer to all the actual keys of the dictionary.
- Return type:
3 different objects are returned
- groups_to_list(groups: str | ContinuousGroup | Sequence[str | ContinuousGroup]) list[str | ContinuousGroup][source]#
Convert a single group or Sequence of groups to a list of groups (possibly with one element)
- Parameters:
groups – Sequence of groups or single groups to convert
- Returns:
Actual list of groups, more easily handled by other functions.
- make_pandas_compatible(data: DataFrame, g: str) tuple[str, str, Literal[True]][source]#
- make_pandas_compatible(data: DataFrame, g: ContinuousGroup, root_data: DataFrame | None = None, key_to_root: str = 'image_id') tuple[str, Series, Literal[False]]
- make_pandas_compatible(data: DataFrame, g: str | ContinuousGroup, root_data: DataFrame | None = None, key_to_root: str = 'image_id') tuple[str, str | Series, bool]
Construct group from
groupthat will be used for pandas’ groupby method.In the case it’s only a name, keep it like that
Otherwise, we need to construct an index of data cut according to the given bins. This will create a
pandas.Serieswith categorical data
- Parameters:
data – input DataFrame, must contain the column considered in group
gg – group depicting a column from
datawith potential bins. Seegrouproot_data – Potential root data where some ids in
datarefer to a particular. columns inroot_data. Defaults to None.key_to_root – column containing
root_datarow ids. Defaults to “image_id”.
- Returns:
group name
group that can be understood by pandas’ groupby method. Can be a simple string referring to a column, or a
pandas.Serieswith categorical databoolean indicating whether the group is categorical (on which different values are independent of each other) or continuous (on which different values represent ranges of a continuous value, constructing a discretized histogram)
- Return type:
Tuple with the 3 following values