balanced_groups#
Functions
Check that histogram and groups are well-formed. |
|
Compute the distance between two dataset share histograms (where bins are splits) by using Intersection over Union (IoU). |
|
Convert dataframe to histograms by using pandas' GroupBy feature |
|
Compute earth mover distance between two columns of a dataframe. |
|
Compute the distance between two distributions described in pandas Series representing histograms. |
- check_groups(histogram: DataFrame | Series, category_groups: Series, continuous_groups: Series) None[source]#
Check that histogram and groups are well-formed.
- Namely:
There should be no overlap between the two groups
histogram must have as many index dimensions as the total number of groups
histogram multi-index names must be unique
there should be a bijection between histogram index names and given category and continuous groups
- Parameters:
histogram – Series or DataFrame with one or two columns, and a multi index whose names must match the next two groups
category_groups – Series whose index are names of category groups, which should be contained in the histogram index
continuous_groups – Series whose index are names of continuous groups, which should be contained in the histogram index
- Raises:
AssertionError – raises an error when histogram and groups don’t respect aforementioned criteria
Compute the distance between two dataset share histograms (where bins are splits) by using Intersection over Union (IoU). We use this distance instead of KL because we don’t want an infinite distance when one of the split is empty.
\[D = \frac{\sum_{i=0}^{n_{splits}} min(left(i), right(i))} {\sum_{i=0}^{n_{splits}} max(left(i), right(i))}\]- Parameters:
left_share – Series representing target histogram of split sizes. It has to be normalized.
right_share – candidate histogram of split sizes
- Returns:
distance computed
- df_to_hist(data: DataFrame, groupby: Any, full_index: Index | MultiIndex | None = None) Series[source]#
Convert dataframe to histograms by using pandas’ GroupBy feature
- Parameters:
data – DataFrame from which the histogram will be computed. Must have the columns specified in groups option.
groupby – Same
byoption forpandas.DataFrame.groupby(), will be passed directly todata.groupbymethod. Can be a mapping, a function, a label, or a list of labels.full_index – Optional index to reindex the resulting histogram. Useful when some value have an occurrence count of 0 and thus don’t appear in the induced index. Defaults to None.
- Returns:
pandas Series with multiindex corresponding to the count of occurrences for each specified group.
- earth_mover_distance(left: Series, right: Series, continuous_weights: Series, sinkhorn_lambda: float = 0) float[source]#
Compute earth mover distance between two columns of a dataframe.
Note
In the case of
sinkhorn_lambda> 0 this uses the sinkhorn algorithm for a faster approximate value.See
ot.sinkhorn2()- Parameters:
left – input Series that represents histograms (not necessarily normalized), and the index represent the histogram bins
right – input Series that represents histograms (not necessarily normalized), and the index represent the histogram bins. Note that
leftandrightdon’t necessarily share the same bins.continuous_weights – Series of index level names to consider in the
left_right_dfdataframe for the sinkhorn algorithm.sinkhorn_lambda – regularization weight for sinkhorn algorithm. If 0, will use literal earth mover distance without regularization (slower but more accurate). Defaults to 0.
- Returns:
distance between the two histograms
- hist_distance(left: Series, right: Series, category_weights: Series, continuous_weights: Series, sinkhorn_lambda: float = 0) float[source]#
Compute the distance between two distributions described in pandas Series representing histograms. Both index must match and may have categorical data or continuous data. Distance between categorical data is made with Kullback–Leibler divergence and distance between continuous data us made with Earth mover distance.
the distance formula is then
(1)#\[D = \sum_{0 \le i < p} \alpha_i KL\left( P_{cat, C_i}, Q_{cat, C_i} \right) + || \beta || \sum_{i \in \Omega_{cat}} \left( P_{cat}(i) \times EMD(P^\beta(i), Q^\beta(i)) \right)\]- where
\(p \in \mathbb{N}\) and \(q \in \mathbb{N}\) are respectively the number of categorical dimensions and continuous dimensions
\(\Omega_{cat} \subset \mathbb{N}^p\) is the set of all possible categories, subdivided into \(p\) dimensions
\[\begin{split}\Omega_{cat} &= \{ c_{0,0}, c_{1,0} \cdots, c_{n_0, 0} \} \times \cdots \times \{ c_{0, p}, \cdots, c_{n_p, p} \} \\ \Omega_{cat} &= C_0 \times \cdots \times C_p\end{split}\]\(P\) is the probability function of the histogram
\[\begin{split}P : \begin{array}{lll} \Omega_{cat} \times \mathbb{R}^q & \rightarrow & [ 0, 1 ] \\ (x,y) = (x_0, \cdots, x_p, y_0 \cdots y_p) & \mapsto & P(x,y) \end{array}\end{split}\]\(P_{cat}\) is the agglomeration of \(P\) over continuous dimensions.
\[\begin{split}P_{cat} : \begin{array}{lll} \Omega_{cat} & \rightarrow & [0, 1] \\ x & \mapsto & \iint_{y \in \mathbb{R}^q} P(x, y) dy \end{array}\end{split}\]\(P_{cat, C_i}\) is the agglomeration of \(P\) over continuous dimensions and category dimensions except \(C_i\)
\[ \begin{align}\begin{aligned}P_{cat, C_i} &: C_i \rightarrow [0, 1]\\P(x) &= \sum_{ x' \in C_0 \times \cdots \times C_{i-1} \times C_{i+1} \times \cdots \times C_p } \iint_{y \in \mathbb{R}^q} P(x'_0, \cdots x'_{i-1}, x, x'_{i+1} \cdots x'_p, y) dy\end{aligned}\end{align} \]\(P(x)\) is the probability distribution over continuous dimensions for a particular category \(x \ in \Omega_{cat}\).
\[\begin{split}P(x) : \begin{array}{lll} \mathbb{R}^q & \rightarrow & [0, 1] \\ y & \mapsto & \ P(x, y) \end{array}\end{split}\]\(P^\beta(x)\) is the weighted probability distribution over continuous dimensions for a particular class \(x\) and a weight vector \(\beta\)
\[ \begin{align}\begin{aligned}P^\beta(x) &: \mathbb{R}^q \rightarrow [0, 1]\\P^\beta(x,y) &= P \left(x, \frac{\beta}{|| \beta ||} \odot y\right)\end{aligned}\end{align} \]\(\alpha \in \mathbb{R}^p\) and \(\beta \in \mathbb{R}^q\) are weight vectors associated to importance of each dimensions of \(\Omega_{cat} \times \mathbb{R}^q\)
\(\odot\) is the Hadamard product
\[\beta \odot y = (\beta_j y_j)_{0 \le j < p}\]\(KL\) is the Kullback–Leibler divergence
\(EMD\) is the Earth Mover distance
Note
This formula is not symmetric, it is more suited to compare a reference distribution (the left one) to a candidate distribution (the right one).
- Parameters:
left – pandas Series representing left distribution of probability (i.e. the reference)
right – pandas Series representing left distribution of probability (i.e. the candidate)
category_weights – weights Series vector associated with \(\alpha\) which is applied to the KL divergence (see formula (1)). Its index must be the names of category groups, that represent
leftandrightindexes dimensions on which to apply KL divergence.continuous_weights – weight Series vector associated with \(\beta\) which is applied to the Earth mover’s distance (see formula (1)). Its index must be the names of category groups, that represent
leftandrightindexes dimensions on which to apply EMD.sinkhorn_lambda – regularization term applied to EMV (see
earth_mover_distance()). Defaults to 0
- Returns:
distance between the two multimodal distributions.