disjoint_groups#

Functions

factorize_sets

From an index-able sequence of sets, partition all possible values in factor sets so that two elements in a particular factor set can be linked with a sequence of input sets with a non-null intersection.

give_already_assigned

Divide a DataFrame with a split column into chunks with an assigned split and unassigned chunks.

make_atomic_chunks

Subdivide the input DataFrame into dissociate chunks from given columns.

Classes

IndexedSet(index, merged_set)

Class representing a set with a corresponding list of indexes indicating what sets in the initial list where used to construct this one.

class IndexedSet(index: set[int], merged_set: set)[source]#

Class representing a set with a corresponding list of indexes indicating what sets in the initial list where used to construct this one. In other word, there’s an original list of sets and the union of all sets indexed make up the current set.

index: set[int]#

index $i$ of sets $S_i$ that were used when constructing this set.

is_disjoint(other: IndexedSet) bool[source]#

Tell if the intersection between current index set and another one is empty or not

Parameters:

other – other indexed set that we want the intersection with

Returns:

True if intersection is empty, False otherwise

merged_set: set#

Resulting set.

\[S = \bigcup_{i \in \text{index}} S_i\]
union(*others: IndexedSet) IndexedSet[source]#

Perform the union operation. union operation is applied on both index sets and the sets themselves.

Parameters:

*others – Iterable of $n$ other indexed sets \((S_i, \text{index}_i)\) to perform the union operation

Returns:

new indexed set with

\[\begin{split}\text{index} &= \text{index}_1 \cup \text{index}_2 \cup \cdots \cup \text{index_n} \\ S &= S_1 \cup S_2 \cup \cdots \cup S_n\end{split}\]

factorize_sets(input_sets: Sequence[set]) list[list[int]][source]#

From an index-able sequence of sets, partition all possible values in factor sets so that two elements in a particular factor set can be linked with a sequence of input sets with a non-null intersection.

\[ \begin{align}\begin{aligned}\widehat{S} = \bigcup_i S_i \in input sets\\\forall x,y \in \widehat{S} , \exists i_0 , i_1, \cdots , i_n, x \in S_{i_0}, y \in S_{i_n}, \forall j, S_{i_j} \cap S_{i_{j+1}} \neq \emptyset\end{aligned}\end{align} \]
Parameters:

input_sets – sequence of sets with possible overlapping values that need to be factorized.

Returns:

list of set indices for each factor. That is, the index in the input sets sequence to recreate the factor sets with a union operation.

give_already_assigned(data: DataFrame, split_column: str = 'split', split_names: Iterable[str] = ()) tuple[list[DataFrame], dict[str, DataFrame]][source]#

Divide a DataFrame with a split column into chunks with an assigned split and unassigned chunks. Unassigned chunks are chunks with an invalid split values (like Nan or None) or split values that are not in the list split_names

Parameters:
  • data – input DataFrame to divide

  • split_column – name of the split column. Defaults to “split”.

  • split_names – list of allowed split names. If the split value is not in it, the group is considered unassigned

Returns:

tuple with 2 elements
  • list of unassigned DataFrame groups

  • dictionary of assigned DataFrame groups where key is the split name

make_atomic_chunks(data: DataFrame, groups: Iterable[str | Series], split_column: str = 'split', split_names: Iterable[str] = ()) tuple[list[DataFrame], dict[str, DataFrame]][source]#

Subdivide the input DataFrame into dissociate chunks from given columns. In other words, for two rows in distinct chunks, there will never be the same elements in the involved columns, and for two rows in the same chunk, there can be a chain of elements all in this chunk to link them. For example, $(A, B)$ and $(C, D)$ have different values for each column, but if there exist a row $(A, D)$, then we can make the chain \((A, B) \rightarrow (A, D) \rightarrow (C, D)\), which means the three rows will be in the same chunk.

Note

In the case the data has a split column with non NaN values, the corresponding rows and the chunk they are linked to will be completely assigned to that split. However, it will raise an error if a theoretically indivisible chunk has rows with different split values.

Parameters:
  • data – DataFrame to be split into dissociated chunks.

  • groups – groups to consider for the dissociation. If group is a string, given DataFrame in data must include a column with this name. If groups is a pandas categorical Series, given DataFrame in data must have the same index.

  • split_column – Name of the column in data where the split value will be grabbed from. Rows with values within split_names will be considered assigned.

  • split_names – Names of wanted splits. rows with split values outside of it will be considered unassigned.

Returns:

  1. List of DataFrames corresponding to the dissociated chunks. concatenating the returned DataFrames would end up in the input DataFrame.

  2. dictionary with already assigned atomic chunk, because the “split” value was already filled in at least one of the rows