testing#

Set of functions used to test some assertions on datasets. Useful when used in unit tests

Functions

assert_bounding_boxes_well_formed

Assert bounding boxes are well-formed in dataset's annotations.

assert_column

From a given input dataframe and a boolean series of the same length, construct an error message if the boolean has at least one False value, with the row in input dataframe corresponding to the row of the first occurrence of False value in the assertion series

assert_columns_properly_normalized

Checks that columns in input dataframes are well normalized, i.e. checks that if column 'A' exists, column 'A.B' does not exists.

assert_dataset_equal

Compare two datasets and raise an assertion error if datasets are not equal.

assert_frame_intersections_equal

Construct inner dataframes from overlapping ids and columns and check they are equal

assert_ids_well_formed

Assert ids follow the right convention.

assert_images_valid

Checks that the image paths in the dataset.

assert_label_map_well_formed

Assert label map has no category name duplicate

assert_required_columns_present

Simple function to check that required columns are present and raise a custom error if it's not the case

full_check_dataset_detection

Perform a full check of the dataset.

get_invalid_images

Checks dataset's images and return an indexed error report to retrieve them.

get_malformed_bounding_boxes

Get malformed bounding in dataset's annotations, as a boolean dataframe where index is id of bounding box in dataset's annotations dataframe, and columns are known reasons for bounding boxes to be invalid

Exceptions

exception InvalidImage[source]#
exception MissingImages[source]#
assert_bounding_boxes_well_formed(dataset: Dataset, allow_keypoints: bool = False) None[source]#

Assert bounding boxes are well-formed in dataset’s annotations.

  • Boxes x and y coordinates must be within their respective image size

  • Boxes width and height must be positive and so that xmax and ymin are within their respective image size

  • in the case of keypoints, Boxes with size 0 will be tolerated

Parameters:
  • dataset – Dataset to test

  • allow_keypoints – If set to True, will not raise error if bounding box size (width or height) is 0. Defaults to False.

assert_column(input_df: DataFrame, assertion: Series | ndarray, message: str = '', n_first_occurrences: int | None = 1) None[source]#

From a given input dataframe and a boolean series of the same length, construct an error message if the boolean has at least one False value, with the row in input dataframe corresponding to the row of the first occurrence of False value in the assertion series

Parameters:
  • input_df – Dataframe to show the row from, to better understand what went wrong

  • assertion – Boolean Series of the same length as input_df, expected to be full of True value

  • message – Message to display when raising the error. Will be followed with information of faulty rows

  • n_first_occurrences – Number of occurrences to show in case of a failure. Useful when showing duplicate values. If set to None, will show all occurrences.

Raises:

AssertionError – If there is at least one occurrence of False in assertion Series, raise an assertion and print the corresponding row of first occurrence in input_df

assert_columns_properly_normalized(input_df: DataFrame, separator: str = '.') None[source]#

Checks that columns in input dataframes are well normalized, i.e. checks that if column ‘A’ exists, column ‘A.B’ does not exists.

This is useful when loading json files to checks that a key cannot be both a sub dictionary and a value

Parameters:
  • input_df – Input DataFrame to test

  • separator – Character used to separate name in flattened key. Defaults to “.”.

Raises:

AssertionError – if there exist a column name where both the name and a variation of name + separator exists

assert_dataset_equal(dataset1: Dataset, dataset2: Dataset, ignore_index: bool = False, optional_columns: Iterable[str] = ('area', 'confidence'), remove_na_columns: bool = False) None[source]#

Compare two datasets and raise an assertion error if datasets are not equal. This function is mainly intended to be used in the context of unit tests.

Rules:
  • Index order is not relevant. This is similar to check_like option in pandas.testing.assert_frame_equal()

  • Indexes for rows and columns still must be the same when reordered

  • Some columns in annotations are optional and are thus ignored if present in one but not the other dataset. If both are present, the columns’ values are still compared.

  • Label maps must be the same. Again, order is ignored (as it normally is for dictionaries)

  • If ignore_index option is set to True, index for rows are not checked, but we still check that the key in annotations’ image_id points to the same rows in images dataframe

Parameters:
  • dataset1 – First dataset to test

  • dataset2 – Second dataset to test, must be the same according to mentioned rules or the function will raise an error

  • ignore_index – If set, will ignore both annotations and images dataframe index, but will still check that link between annotations and image row with image_id is the same. Defaults to False.

  • optional_columns – Iterable of column names that will considered as optional, i.e. only check them if they are both present. Defaults to the column names “area” and “confidence”.

  • remove_na_columns – If set to True, will remove from dataframes columns where all values are equivalent to panda’s <NA>. This more lenient comparison is useful for columns where its absence and its values being all <NA> are treated the same, like the split column.

Raises:

AssertionError – raised when datasets are detected to be different

assert_frame_intersections_equal(df1: DataFrame, df2: DataFrame) None[source]#

Construct inner dataframes from overlapping ids and columns and check they are equal

These are the rows and columns present in both images dataframes The two dataframes must have the same values for the merge to be valid

Parameters:
  • df1 – First dataframe to test

  • df2 – Second dataframe to test

Raises:

AssertionError – Raise error if both subdataframe constructed with intersections of indexes and columns are not the same.

assert_ids_well_formed(dataset: Dataset) None[source]#

Assert ids follow the right convention.

  • DataFrames indexes must be named “id”

  • indexes must have no duplicates

  • images relative_path column must have no duplicates

  • annotation image_id values must all be in images index

  • annotation category_id values must be in dataset’s label map

Note

Todo: Better error messages

Parameters:

dataset – Dataset object to test.

assert_images_valid(dataset: Dataset, assert_is_symlink: bool = False, load_images: bool = True, check_exhaustive: bool = False) None[source]#

Checks that the image paths in the dataset. Namely, checks that all path are indeed pointing to a file, and are valid file format that can be loaded with imageio.

Note

Todo: better error messages

Parameters:
  • dataset – Dataset to check

  • assert_is_symlink – If set, will check that paths are symlinks rather than files. Defaults to False.

  • load_images – If set to True, will not only check that images are valid files, but also that image can be loaded (i.e. are not corrupted files) and that their sizes match the ones included in dataset.images dataframe. Note that this makes the function significantly slower. Defaults to True.

  • check_exhaustive – If set to True, will check that all images in the images_root folder are in the image dataframe, and that the dataset is indeed exhaustive

assert_label_map_well_formed(dataset: Dataset) None[source]#

Assert label map has no category name duplicate

Parameters:

dataset – dataset to test.

assert_required_columns_present(input_df: DataFrame, required_columns: set[str], df_name: str) None[source]#

Simple function to check that required columns are present and raise a custom error if it’s not the case

Parameters:
  • input_df – dataframe object to check.

  • required_columns – set of column names to find in the columns of input_df.

  • df_name – name of the dataframe, used to add context to the error message.

Raises:

ValueError – Raised when not all required columns are present in the columns of input_df.

full_check_dataset_detection(dataset: Dataset, check_symlink: bool = False, allow_keypoints: bool = False, check_exhaustive: bool = False) None[source]#

Perform a full check of the dataset. Images must be reachable for the test to perform.

Parameters:
  • dataset – dataset to test

  • check_symlink – If set to True, will check that image relative paths are indeed relative links and not actual files. Defaults to False.

  • allow_keypoints – If set to True, will not raise an error for bounding boxes with size 0 (width or height). Defaults to False.

  • check_exhaustive – If set to True, will check that all images in the images_root folder are in the image dataframe, and that the dataset is indeed exhaustive

get_invalid_images(dataset: Dataset, check_symlink: bool = False, load_images: bool = True, check_exhaustive: bool = False, raise_if_error: bool = True) DataFrame[source]#

Checks dataset’s images and return an indexed error report to retrieve them.

Namely, checks that all path are indeed pointing to a file, and are valid file format that can be loaded with imageio. If unsuccessful, add a row to the output dataframe with the same index as the faulty images, and info about the error in corresponding columns

Parameters:
  • dataset – Dataset to check

  • check_symlink – If set, will check that paths are symlinks rather than files. Defaults to False.

  • load_images – If set to True, will not only check that images are valid files, but also that image can be loaded (i.e. are not corrupted files) and that their sizes match the ones included in dataset.images dataframe. Note that this makes the function significantly slower. Defaults to True.

  • check_exhaustive – If set to True, will check that all images in the images_root folder are in the image dataframe, and that the dataset is indeed exhaustive

  • raise_if_error – If set to True, will raise an InvalidImage error as soon as one image does not meet the requirements.

Raises:
  • InvalidImage – Raised if raise_if_error is selected and one image is not valid. Can be because the path is not right, the image loading failed, or the metadata is not compliant with actual image data.

  • MissingImages – Raised if raise_if_error is selected and some images where found in the images_root folder but not in the dataset’s images dataframe.

Returns:

Error report in the form of a Dataframe with “reason” and “additional_info” columns. Index values are the same as the corresponding images in the original dataset, so that you can retrieve the faulty images full data.

get_malformed_bounding_boxes(dataset: Dataset, allow_keypoints: bool = False, raise_if_error: bool = False) DataFrame[source]#

Get malformed bounding in dataset’s annotations, as a boolean dataframe where index is id of bounding box in dataset’s annotations dataframe, and columns are known reasons for bounding boxes to be invalid

  • Boxes x and y coordinates must be within their respective image size

  • Boxes width and height must be positive and so that xmax and ymin are within their respective image size

  • in the case of keypoints, Boxes with size 0 will be tolerated

An invalid bounding box is then related to a row in the result dataframe where at least one of the value is True. Note that valid bounding boxes are NOT in the result dataframe. This means that if the dataset has no invalid bounding box, the result dataframe will be empty, and for each row in the result dataframe, there will be at least one True value.

Parameters:
  • dataset – Dataset to test

  • allow_keypoints – If set to True, will not raise error if bounding box size (width or height) is 0. Defaults to False.

  • raise_if_error – If set to True, will raise an error as soon as one bounding box is detected to be invalid. Defaults to False.

Raises:

AssertionError – When raise_if_error is set, raise an error as soon as one bounding box is invalid.

Returns:

Error report as a dataframe with boolean columns.

  • Each column is a reason why the bounding box can be faulty.

  • Each row is a faulty bounding box, with its corresponding index in dataset’s annotation dataframe. Its value explain how the bounding box is invalid.

  • Only the faulty bounding boxes are kept in the error report, so all rows have at least one value set to True.