disjoint_groups#
Functions
Divide a DataFrame with a split column into chunks with an assigned split and unassigned chunks. |
|
Subdivide the input DataFrame into dissociate chunks from given columns. |
- give_already_assigned(data: DataFrame, split_column: str = 'split', split_names: Iterable[str] = ()) tuple[list[DataFrame], dict[str, DataFrame]][source]#
Divide a DataFrame with a split column into chunks with an assigned split and unassigned chunks. Unassigned chunks are chunks with an invalid split values (like Nan or None) or split values that are not in the list
split_names- Parameters:
data – input DataFrame to divide
split_column – name of the split column. Defaults to “split”.
split_names – list of allowed split names. If the split value is not in it, the group is considered unassigned
- Returns:
- tuple with 2 elements
list of unassigned DataFrame groups
dictionary of assigned DataFrame groups where key is the split name
- make_atomic_chunks(data: DataFrame, groups: Iterable[str | Series], split_column: str = 'split', split_names: Iterable[str] = ()) tuple[list[DataFrame], dict[str, DataFrame]][source]#
Subdivide the input DataFrame into dissociate chunks from given columns. In other words, for two rows in distinct chunks, there will never be the same elements in the involved columns, and for two rows in the same chunk, there can be a chain of elements all in this chunk to link them. For example, $(A, B)$ and $(C, D)$ have different values for each column, but if there exist a row $(A, D)$, then we can make the chain \((A, B) \rightarrow (A, D) \rightarrow (C, D)\), which means the three rows will be in the same chunk.
Notes
In the case the data has a
splitcolumn with non NaN values, the corresponding rows and the chunk they are linked to will be completely assigned to that split. However, it will be completely unassigned if a theoretically indivisible chunk has rows with different split values.NaN, None or NA values are considered a unique group value, different from all other values, NaN or not. This is thus equivalent to having e.g. a UUID.
- Parameters:
data – DataFrame to be split into dissociated chunks.
groups – groups to consider for the dissociation. If group is a string, given DataFrame in
datamust include a column with this name. If groups is a pandas categorical Series, given DataFrame indatamust have the same index.split_column – Name of the column in
datawhere the split value will be grabbed from. Rows with values withinsplit_nameswill be considered assigned.split_names – Names of wanted splits. rows with split values outside of it will be considered unassigned.
- Returns:
List of DataFrames corresponding to the dissociated chunks.
Dictionary with already assigned atomic chunk, because the “split” value was already filled in at least one of the rows
concatenating the returned DataFrames in the list and the dictionary values would end up in the input DataFrame.
- Return type:
Tuple with 2 elements