dataset_splitter#

Functions

get_winner

Get the best split i.e. with the lowest from series of precomputed costs.

split_dataframe

Perform the split operation on input_data and root_data.

get_winner(split_hists: DataFrame | None, split_hists_distances: Series | None, candidate_hist: Series | None, split_sizes: Series, candidate_size: int, hist_cost_function: Callable[[Series], float], share_cost_function: Callable[[Series], float], hist_cost_weight: float = 1, share_cost_weight: float = 1) tuple[str, DataFrame | None, Series | None, Series][source]#

Get the best split i.e. with the lowest from series of precomputed costs. The series are histogram costs, i.e. with distribution distances for values the user which to be evenly distributed between splits, and the share costs the IOU distance between

The result is then the key of the dictionary with the lowest consolidated cost. A special case is when all distribution costs are infinite. In that case, only consider the share cost.

Parameters:
  • split_hists – DataFrame containing the current histograms of splits. Columns are splits, and rows are histogram bins

  • split_hists_distances – Series containing the cached distance values of distance between the split hist and the target histogram. If set to None, will recompute them

  • candidate_hist – Series containing the histogram of the candidate atom. rows are the same as split_hists

  • split_sizes – Series containing the sizes of each split each row is a split.

  • candidate_size – size of current atom. Depending on how the split is done, it’s not necessary the same as the sum of candidate histogram.

  • hist_cost_function – function that computes a score for a dataframe of histograms. This will be used to compute the histogram cost for each split if the atom was to be assigned to it.

  • share_cost_function – function that computes a score for dataset repartition against a target split share. This is used to compute the cost of assigning the candidate atom to each split.

  • hist_cost_weight – weight applied to histogram cost to choose the winner split. The higher, the more important the histogram cost will be for the decision. Defaults to 1.

  • share_cost_weight – weight applied to share cost to choose the winner split. The higher, the more important the share cost will be for the decision. Defaults to 1.

Returns:

A tuple with 4 elements
  • name of the winning split

  • updated split histograms, as a DataFrame similar to split_hists (None if given split_hists was None)

  • updated split hist costs, as a Series, similar to split_hists_distances (None if given split_hists was None)

  • updated share of splits, as a Series, similar to split_shares

split_dataframe(input_data: DataFrame, root_data: DataFrame, key_to_root: str = 'image_id', input_seed: int = 0, split_names: Iterable[str] = ('train', 'valid'), target_split_shares: Iterable[float] = (0.8, 0.2), split_column_name: str = 'split', keep_separate_groups: str | ContinuousGroup | Sequence[str | ContinuousGroup] = ('image_id',), keep_balanced_groups: str | ContinuousGroup | Sequence[str | ContinuousGroup] = ('category_id',), keep_balanced_groups_weights: Iterable[float] | None = None, inplace: bool = False, split_at_root_level: bool = False, hist_cost_weight: float = 1, share_cost_weight: float = 1, earth_mover_regularization: float = 0) tuple[DataFrame, DataFrame][source]#
split_dataframe(input_data: DataFrame, root_data: None = None, key_to_root: str = 'image_id', input_seed: int = 0, split_names: Iterable[str] = ('train', 'valid'), target_split_shares: Iterable[float] = (0.8, 0.2), split_column_name: str = 'split', keep_separate_groups: str | ContinuousGroup | Sequence[str | ContinuousGroup] = ('image_id',), keep_balanced_groups: str | ContinuousGroup | Sequence[str | ContinuousGroup] = ('category_id',), keep_balanced_groups_weights: Iterable[float] | None = None, inplace: bool = False, split_at_root_level: bool = False, hist_cost_weight: float = 1, share_cost_weight: float = 1, earth_mover_regularization: float = 0) DataFrame
split_dataframe(input_data: DataFrame, root_data: DataFrame | None = None, key_to_root: str = 'image_id', input_seed: int = 0, split_names: Iterable[str] = ('train', 'valid'), target_split_shares: Iterable[float] = (0.8, 0.2), split_column_name: str = 'split', keep_separate_groups: str | ContinuousGroup | Sequence[str | ContinuousGroup] = ('image_id',), keep_balanced_groups: str | ContinuousGroup | Sequence[str | ContinuousGroup] = ('category_id',), keep_balanced_groups_weights: Iterable[float] | None = None, inplace: bool = False, split_at_root_level: bool = False, hist_cost_weight: float = 1, share_cost_weight: float = 1, earth_mover_regularization: float = 0) DataFrame | tuple[DataFrame, DataFrame]

Perform the split operation on input_data and root_data.

This algorithm works in 2 steps:

  1. divide the dataframe into atomic sub frames. Given the image and annotation

    attributes that need to be kept separate, we can construct sub frame of elements that cannot be in different splits.

  2. Construct the split dataframes iteratively by trying to keep given column values

    with a balanced repartition between splits, along with keeping split sizes as close to target share as possible. Each atomic sub frame is routed to the split that minimize a cost function which try to optimize repartition targets.

Parameters:
  • input_data – DataFrame containing input_data information, must contain at least the column given in key_to_root.

  • root_data – DataFrame containing image information. its index must contain all values contained in the image_id column of the input_data DataFrame.

  • key_to_root – name of the column in input that refers to id in root data dataframe. Defaults to “image_id”.

  • input_seed – Seed used for shuffling sub frames before beginning step 2 of splitting algorithm. Defaults to 0.

  • split_names – Names of splits. Must be the same length as target_split_shares. Defaults to (“train”, “valid”).

  • target_split_shares – List of relative size of each split. Must be the same length as split_names, and will be normalized so that its sum is 1. Defaults to (0.8, 0.2).

  • split_column_name – Name of the column where the split value of dataset will be read and written. Defaults to “split”.

  • keep_separate_groups – columns in input_data or root_data` DataFrame to keep separate. That is for a particular column, two rows with the same value cannot be in different splits. Defaults to (“image_id”,).

  • keep_balanced_groups – columns or groups (as defined in input_data or root_data DataFrames to keep balanced. That is for a particular column, the distribution of values is the same between original DataFrame and its split, as much as possible. Defaults to (“category_id”,).

  • keep_balanced_groups_weights – Importance of each group to keep balanced when computing histogram cost. If not None, must be of the same size as keep_separate_groups. Defaults to None.

  • inplace – If set, will modify dataframes inplace. This can silently modify some objects (like Datasets) that use them. Defaults to False.

  • split_at_root_level – If set, will compute split sizes (and thus share distances) at root level, i.e. regarding sizes in the root_data dataframe. As a consequence, the split column name will be added to keep_separate_input_groups if it’s not already in it, and the number of rows in the input data per row in root data will not have any influence on the share cost.

  • hist_cost_weight – importance of histogram cost for balanced groups. The higher, the more important the histogram cost will be for the decisio of where to put each split. Defaults to 1.

  • share_cost_weight – importance of share cost for balanced groups. The higher, the more important the share cost will be for the decision of where to put each split. Defaults to 1.

  • earth_mover_regularization – Regularization parameter applied to sinkhorn’s algorithm during earth mover distance computation. See earth_mover_distance(). Defaults to 0

Returns:

new annotation and root_data with the split column populated with the corresponding split name.