dataset_splitter#
Functions
Get the best split i.e. with the lowest from series of precomputed costs. |
|
Simple version of splitting method, splitting unassigned rows randomly. |
|
Perform the split operation on input_data and root_data. |
- check_split_target(split_names: Sequence[str], target_split_shares: Sequence[float]) Series[source]#
- get_winner(split_hists: DataFrame | None, split_hists_distances: Series | None, candidate_hist: Series | None, split_sizes: Series, candidate_size: int, hist_cost_function: Callable[[Series], float], share_cost_function: Callable[[Series], float], hist_cost_weight: float = 1, share_cost_weight: float = 1) tuple[str, DataFrame | None, Series | None, Series][source]#
Get the best split i.e. with the lowest from series of precomputed costs. The series are histogram costs, i.e. with distribution distances for values the user which to be evenly distributed between splits, and the share costs the IOU distance between
The result is then the key of the dictionary with the lowest consolidated cost. A special case is when all distribution costs are infinite. In that case, only consider the share cost.
- Parameters:
split_hists – DataFrame containing the current histograms of splits. Columns are splits, and rows are histogram bins
split_hists_distances – Series containing the cached distance values of distance between the split hist and the target histogram. If set to None, will recompute them
candidate_hist – Series containing the histogram of the candidate atom. rows are the same as
split_histssplit_sizes – Series containing the sizes of each split each row is a split.
candidate_size – size of current atom. Depending on how the split is done, it’s not necessary the same as the sum of candidate histogram.
hist_cost_function – function that computes a score for a dataframe of histograms. This will be used to compute the histogram cost for each split if the atom was to be assigned to it.
share_cost_function – function that computes a score for dataset repartition against a target split share. This is used to compute the cost of assigning the candidate atom to each split.
hist_cost_weight – weight applied to histogram cost to choose the winner split. The higher, the more important the histogram cost will be for the decision. Defaults to 1.
share_cost_weight – weight applied to share cost to choose the winner split. The higher, the more important the share cost will be for the decision. Defaults to 1.
- Returns:
- A tuple with 4 elements
name of the winning split
updated split histograms, as a DataFrame similar to
split_hists(None if givensplit_histswas None)updated split hist costs, as a Series, similar to
split_hists_distances(None if givensplit_histswas None)updated share of splits, as a Series, similar to
split_shares
- simple_split_dataframe(input_data: DataFrame, input_seed: int = 0, split_names: Sequence[str] = ('train', 'valid'), target_split_shares: Sequence[float] = (0.8, 0.2), inplace: bool = False) DataFrame[source]#
Simple version of splitting method, splitting unassigned rows randomly.
Note
If target split shares and already assigned rows are incompatible, a warning will be issued, and the splitting process will continueusing relative target shares for remaining splits instead.
- Parameters:
input_data – DataFrame to assign split values.
input_seed – Random seed for splitting images. Defaults to 0.
split_names – Names of splits. Must be more than 1 element long and the same size as
target_split_shares. Defaults to("train", "valid").target_split_shares – Share values of each split. Must be the same size as
split_names. Must add up to 1. Defaults to(0.8, 0.2).inplace – If set to True, will perform the splitting inplace without creating a new dataset. Defaults to False.
- Returns:
DataFrame with new splits applied to its
splitcolumn.
- split_dataframe(input_data: DataFrame, root_data: DataFrame, key_to_root: str = 'image_id', input_seed: int = 0, split_names: Sequence[str] = ('train', 'valid'), target_split_shares: Sequence[float] = (0.8, 0.2), split_column_name: str = 'split', keep_separate_groups: str | ContinuousGroup | Sequence[str | ContinuousGroup] = ('image_id',), keep_balanced_groups: str | ContinuousGroup | Sequence[str | ContinuousGroup] = ('category_id',), keep_balanced_groups_weights: Sequence[float] | None = None, inplace: bool = False, split_at_root_level: bool = False, hist_cost_weight: float = 1, share_cost_weight: float = 1, earth_mover_regularization: float = 0) tuple[DataFrame, DataFrame][source]#
- split_dataframe(input_data: DataFrame, root_data: None = None, key_to_root: str = 'image_id', input_seed: int = 0, split_names: Sequence[str] = ('train', 'valid'), target_split_shares: Sequence[float] = (0.8, 0.2), split_column_name: str = 'split', keep_separate_groups: str | ContinuousGroup | Sequence[str | ContinuousGroup] = ('image_id',), keep_balanced_groups: str | ContinuousGroup | Sequence[str | ContinuousGroup] = ('category_id',), keep_balanced_groups_weights: Sequence[float] | None = None, inplace: bool = False, split_at_root_level: bool = False, hist_cost_weight: float = 1, share_cost_weight: float = 1, earth_mover_regularization: float = 0) DataFrame
- split_dataframe(input_data: DataFrame, root_data: DataFrame | None = None, key_to_root: str = 'image_id', input_seed: int = 0, split_names: Sequence[str] = ('train', 'valid'), target_split_shares: Sequence[float] = (0.8, 0.2), split_column_name: str = 'split', keep_separate_groups: str | ContinuousGroup | Sequence[str | ContinuousGroup] = ('image_id',), keep_balanced_groups: str | ContinuousGroup | Sequence[str | ContinuousGroup] = ('category_id',), keep_balanced_groups_weights: Sequence[float] | None = None, inplace: bool = False, split_at_root_level: bool = False, hist_cost_weight: float = 1, share_cost_weight: float = 1, earth_mover_regularization: float = 0) DataFrame | tuple[DataFrame, DataFrame]
Perform the split operation on input_data and root_data.
This algorithm works in 2 steps:
- divide the dataframe into atomic sub frames. Given the image and annotation
attributes that need to be kept separate, we can construct sub frame of elements that cannot be in different splits.
- Construct the split dataframes iteratively by trying to keep given column values
with a balanced repartition between splits, along with keeping split sizes as close to target share as possible. Each atomic sub frame is routed to the split that minimize a cost function which try to optimize repartition targets.
- Parameters:
input_data – DataFrame containing input_data information, must contain at least the column given in
key_to_root.root_data – DataFrame containing image information. its index must contain all values contained in the
image_idcolumn of the input_data DataFrame.key_to_root – name of the column in input that refers to id in root data dataframe. Defaults to “image_id”.
input_seed – Seed used for shuffling sub frames before beginning step 2 of splitting algorithm. Defaults to 0.
split_names – Names of splits. Must be the same length as
target_split_shares. Defaults to (“train”, “valid”).target_split_shares – List of relative size of each split. Must be the same length as
split_names, and will be normalized so that its sum is 1. Defaults to (0.8, 0.2).split_column_name – Name of the column where the split value of dataset will be read and written. Defaults to “split”.
keep_separate_groups – columns in
input_dataor root_data` DataFrame to keep separate. That is for a particular column, two rows with the same value cannot be in different splits. Defaults to (“image_id”,).keep_balanced_groups – columns or groups (as defined in
input_dataorroot_dataDataFrames to keep balanced. That is for a particular column, the distribution of values is the same between original DataFrame and its split, as much as possible. Defaults to (“category_id”,).keep_balanced_groups_weights – Importance of each group to keep balanced when computing histogram cost. If not None, must be of the same size as
keep_separate_groups. Defaults to None.inplace – If set, will modify dataframes inplace. This can silently modify some objects (like Datasets) that use them. Defaults to False.
split_at_root_level – If set, will compute split sizes (and thus share distances) at root level, i.e. regarding sizes in the
root_datadataframe. As a consequence, the split column name will be added tokeep_separate_input_groupsif it’s not already in it, and the number of rows in the input data per row in root data will not have any influence on the share cost.hist_cost_weight – importance of histogram cost for balanced groups. The higher, the more important the histogram cost will be for the decisio of where to put each split. Defaults to 1.
share_cost_weight – importance of share cost for balanced groups. The higher, the more important the share cost will be for the decision of where to put each split. Defaults to 1.
earth_mover_regularization – Regularization parameter applied to sinkhorn’s algorithm during earth mover distance computation. See
earth_mover_distance(). Defaults to 0
- Returns:
new annotation and root_data with the split column populated with the corresponding split name.