balanced_groups#

Functions

check_groups

Check that histogram and groups are well-formed.

dataset_share_distance

Compute the distance between two dataset share histograms (where bins are splits) by using Intersection over Union (IoU).

df_to_hist

Convert dataframe to histograms by using pandas' GroupBy feature

earth_mover_distance

Compute earth mover distance between two columns of a dataframe.

hist_distance

Compute the distance between two distributions described in pandas Series representing histograms.

check_groups(histogram: DataFrame | Series, category_groups: Series, continuous_groups: Series) None[source]#

Check that histogram and groups are well-formed.

Namely:
  • There should be no overlap between the two groups

  • histogram must have as many index dimensions as the total number of groups

  • histogram multi-index names must be unique

  • there should be a bijection between histogram index names and given category and continuous groups

Parameters:
  • histogram – Series or DataFrame with one or two columns, and a multi index whose names must match the next two groups

  • category_groups – Series whose index are names of category groups, which should be contained in the histogram index

  • continuous_groups – Series whose index are names of continuous groups, which should be contained in the histogram index

Raises:

AssertionError – raises an error when histogram and groups don’t respect aforementioned criteria

dataset_share_distance(left_share: Series, right_share: Series) float[source]#

Compute the distance between two dataset share histograms (where bins are splits) by using Intersection over Union (IoU). We use this distance instead of KL because we don’t want an infinite distance when one of the split is empty.

\[D = \frac{\sum_{i=0}^{n_{splits}} min(left(i), right(i))} {\sum_{i=0}^{n_{splits}} max(left(i), right(i))}\]
Parameters:
  • left_share – Series representing target histogram of split sizes. It has to be normalized.

  • right_share – candidate histogram of split sizes

Returns:

distance computed

df_to_hist(data: DataFrame, groupby: Any, full_index: Index | MultiIndex | None = None) Series[source]#

Convert dataframe to histograms by using pandas’ GroupBy feature

Parameters:
  • data – DataFrame from which the histogram will be computed. Must have the columns specified in groups option.

  • groupby – Same by option for pandas.DataFrame.groupby(), will be passed directly to data.groupby method. Can be a mapping, a function, a label, or a list of labels.

  • full_index – Optional index to reindex the resulting histogram. Useful when some value have an occurrence count of 0 and thus don’t appear in the induced index. Defaults to None.

Returns:

pandas Series with multiindex corresponding to the count of occurrences for each specified group.

earth_mover_distance(left: Series, right: Series, continuous_weights: Series, sinkhorn_lambda: float = 0) float[source]#

Compute earth mover distance between two columns of a dataframe.

Note

In the case of sinkhorn_lambda > 0 this uses the sinkhorn algorithm for a faster approximate value.

See ot.sinkhorn2()

Parameters:
  • left – input Series that represents histograms (not necessarily normalized), and the index represent the histogram bins

  • right – input Series that represents histograms (not necessarily normalized), and the index represent the histogram bins. Note that left and right don’t necessarily share the same bins.

  • continuous_weights – Series of index level names to consider in the left_right_df dataframe for the sinkhorn algorithm.

  • sinkhorn_lambda – regularization weight for sinkhorn algorithm. If 0, will use literal earth mover distance without regularization (slower but more accurate). Defaults to 0.

Returns:

distance between the two histograms

hist_distance(left: Series, right: Series, category_weights: Series, continuous_weights: Series, sinkhorn_lambda: float = 0) float[source]#

Compute the distance between two distributions described in pandas Series representing histograms. Both index must match and may have categorical data or continuous data. Distance between categorical data is made with Kullback–Leibler divergence and distance between continuous data us made with Earth mover distance.

the distance formula is then

(1)#\[D = \sum_{0 \le i < p} \alpha_i KL\left( P_{cat, C_i}, Q_{cat, C_i} \right) + || \beta || \sum_{i \in \Omega_{cat}} \left( P_{cat}(i) \times EMD(P^\beta(i), Q^\beta(i)) \right)\]
where
  • \(p \in \mathbb{N}\) and \(q \in \mathbb{N}\) are respectively the number of categorical dimensions and continuous dimensions

  • \(\Omega_{cat} \subset \mathbb{N}^p\) is the set of all possible categories, subdivided into \(p\) dimensions

    \[\begin{split}\Omega_{cat} &= \{ c_{0,0}, c_{1,0} \cdots, c_{n_0, 0} \} \times \cdots \times \{ c_{0, p}, \cdots, c_{n_p, p} \} \\ \Omega_{cat} &= C_0 \times \cdots \times C_p\end{split}\]
  • \(P\) is the probability function of the histogram

    \[\begin{split}P : \begin{array}{lll} \Omega_{cat} \times \mathbb{R}^q & \rightarrow & [ 0, 1 ] \\ (x,y) = (x_0, \cdots, x_p, y_0 \cdots y_p) & \mapsto & P(x,y) \end{array}\end{split}\]
  • \(P_{cat}\) is the agglomeration of \(P\) over continuous dimensions.

    \[\begin{split}P_{cat} : \begin{array}{lll} \Omega_{cat} & \rightarrow & [0, 1] \\ x & \mapsto & \iint_{y \in \mathbb{R}^q} P(x, y) dy \end{array}\end{split}\]
  • \(P_{cat, C_i}\) is the agglomeration of \(P\) over continuous dimensions and category dimensions except \(C_i\)

    \[ \begin{align}\begin{aligned}P_{cat, C_i} &: C_i \rightarrow [0, 1]\\P(x) &= \sum_{ x' \in C_0 \times \cdots \times C_{i-1} \times C_{i+1} \times \cdots \times C_p } \iint_{y \in \mathbb{R}^q} P(x'_0, \cdots x'_{i-1}, x, x'_{i+1} \cdots x'_p, y) dy\end{aligned}\end{align} \]
  • \(P(x)\) is the probability distribution over continuous dimensions for a particular category \(x \ in \Omega_{cat}\).

    \[\begin{split}P(x) : \begin{array}{lll} \mathbb{R}^q & \rightarrow & [0, 1] \\ y & \mapsto & \ P(x, y) \end{array}\end{split}\]
  • \(P^\beta(x)\) is the weighted probability distribution over continuous dimensions for a particular class \(x\) and a weight vector \(\beta\)

    \[ \begin{align}\begin{aligned}P^\beta(x) &: \mathbb{R}^q \rightarrow [0, 1]\\P^\beta(x,y) &= P \left(x, \frac{\beta}{|| \beta ||} \odot y\right)\end{aligned}\end{align} \]
  • \(\alpha \in \mathbb{R}^p\) and \(\beta \in \mathbb{R}^q\) are weight vectors associated to importance of each dimensions of \(\Omega_{cat} \times \mathbb{R}^q\)

  • \(\odot\) is the Hadamard product

    \[\beta \odot y = (\beta_j y_j)_{0 \le j < p}\]
  • \(KL\) is the Kullback–Leibler divergence

  • \(EMD\) is the Earth Mover distance

Note

This formula is not symmetric, it is more suited to compare a reference distribution (the left one) to a candidate distribution (the right one).

Parameters:
  • left – pandas Series representing left distribution of probability (i.e. the reference)

  • right – pandas Series representing left distribution of probability (i.e. the candidate)

  • category_weights – weights Series vector associated with \(\alpha\) which is applied to the KL divergence (see formula (1)). Its index must be the names of category groups, that represent left and right indexes dimensions on which to apply KL divergence.

  • continuous_weights – weight Series vector associated with \(\beta\) which is applied to the Earth mover’s distance (see formula (1)). Its index must be the names of category groups, that represent left and right indexes dimensions on which to apply EMD.

  • sinkhorn_lambda – regularization term applied to EMV (see earth_mover_distance()). Defaults to 0

Returns:

distance between the two multimodal distributions.