testing#
Set of functions used to test some assertions on datasets. Useful when used in unit tests
Functions
Assert bounding boxes are well-formed in dataset's annotations. |
|
From a given input dataframe and a boolean series of the same length, construct an error message if the boolean has at least one False value, with the row in input dataframe corresponding to the row of the first occurrence of False value in the assertion series |
|
Checks that columns in input dataframes are well normalized, i.e. checks that if column 'A' exists, column 'A.B' does not exists. |
|
Compare two datasets and raise an assertion error if datasets are not equal. |
|
Construct inner dataframes from overlapping ids and columns and check they are equal |
|
Assert ids follow the right convention. |
|
Checks that the image paths in the dataset. |
|
Assert label map has no category name duplicate |
|
Simple function to check that required columns are present and raise a custom error if it's not the case |
|
Perform a full check of the dataset. |
|
Checks dataset's images and return an indexed error report to retrieve them. |
|
Get malformed bounding in dataset's annotations, as a boolean dataframe where index is id of bounding box in dataset's annotations dataframe, and columns are known reasons for bounding boxes to be invalid |
Exceptions
- assert_bounding_boxes_well_formed(dataset: Dataset, allow_keypoints: bool = False) None[source]#
Assert bounding boxes are well-formed in dataset’s annotations.
Boxes x and y coordinates must be within their respective image size
Boxes width and height must be positive and so that xmax and ymin are within their respective image size
in the case of keypoints, Boxes with size 0 will be tolerated
- Parameters:
dataset – Dataset to test
allow_keypoints – If set to True, will not raise error if bounding box size (width or height) is 0. Defaults to False.
- assert_column(input_df: DataFrame, assertion: Series | ndarray, message: str = '', n_first_occurrences: int | None = 1) None[source]#
From a given input dataframe and a boolean series of the same length, construct an error message if the boolean has at least one False value, with the row in input dataframe corresponding to the row of the first occurrence of False value in the assertion series
- Parameters:
input_df – Dataframe to show the row from, to better understand what went wrong
assertion – Boolean Series of the same length as
input_df, expected to be full of True valuemessage – Message to display when raising the error. Will be followed with information of faulty rows
n_first_occurrences – Number of occurrences to show in case of a failure. Useful when showing duplicate values. If set to None, will show all occurrences.
- Raises:
AssertionError – If there is at least one occurrence of False in
assertionSeries, raise an assertion and print the corresponding row of first occurrence ininput_df
- assert_columns_properly_normalized(input_df: DataFrame, separator: str = '.') None[source]#
Checks that columns in input dataframes are well normalized, i.e. checks that if column ‘A’ exists, column ‘A.B’ does not exists.
This is useful when loading json files to checks that a key cannot be both a sub dictionary and a value
- Parameters:
input_df – Input DataFrame to test
separator – Character used to separate name in flattened key. Defaults to “.”.
- Raises:
AssertionError – if there exist a column name where both the name and a variation of name + separator exists
- assert_dataset_equal(dataset1: Dataset, dataset2: Dataset, ignore_index: bool = False, optional_columns: Iterable[str] = ('area', 'confidence'), remove_na_columns: bool = False) None[source]#
Compare two datasets and raise an assertion error if datasets are not equal. This function is mainly intended to be used in the context of unit tests.
- Rules:
Index order is not relevant. This is similar to
check_likeoption inpandas.testing.assert_frame_equal()Indexes for rows and columns still must be the same when reordered
Some columns in annotations are optional and are thus ignored if present in one but not the other dataset. If both are present, the columns’ values are still compared.
Label maps must be the same. Again, order is ignored (as it normally is for dictionaries)
If
ignore_indexoption is set toTrue, index for rows are not checked, but we still check that the key in annotations’image_idpoints to the same rows in images dataframe
- Parameters:
dataset1 – First dataset to test
dataset2 – Second dataset to test, must be the same according to mentioned rules or the function will raise an error
ignore_index – If set, will ignore both annotations and images dataframe index, but will still check that link between annotations and image row with
image_idis the same. Defaults to False.optional_columns – Iterable of column names that will considered as optional, i.e. only check them if they are both present. Defaults to the column names “area” and “confidence”.
remove_na_columns – If set to True, will remove from dataframes columns where all values are equivalent to panda’s
<NA>. This more lenient comparison is useful for columns where its absence and its values being all<NA>are treated the same, like thesplitcolumn.
- Raises:
AssertionError – raised when datasets are detected to be different
- assert_frame_intersections_equal(df1: DataFrame, df2: DataFrame) None[source]#
Construct inner dataframes from overlapping ids and columns and check they are equal
These are the rows and columns present in both images dataframes The two dataframes must have the same values for the merge to be valid
- Parameters:
df1 – First dataframe to test
df2 – Second dataframe to test
- Raises:
AssertionError – Raise error if both subdataframe constructed with intersections of indexes and columns are not the same.
- assert_ids_well_formed(dataset: Dataset) None[source]#
Assert ids follow the right convention.
DataFrames indexes must be named “id”
indexes must have no duplicates
images
relative_pathcolumn must have no duplicatesannotation
image_idvalues must all be in images indexannotation
category_idvalues must be in dataset’s label map
Note
Todo: Better error messages
- Parameters:
dataset – Dataset object to test.
- assert_images_valid(dataset: Dataset, assert_is_symlink: bool = False, load_images: bool = True, check_exhaustive: bool = False) None[source]#
Checks that the image paths in the dataset. Namely, checks that all path are indeed pointing to a file, and are valid file format that can be loaded with
imageio.Note
Todo: better error messages
- Parameters:
dataset – Dataset to check
assert_is_symlink – If set, will check that paths are symlinks rather than files. Defaults to False.
load_images – If set to True, will not only check that images are valid files, but also that image can be loaded (i.e. are not corrupted files) and that their sizes match the ones included in
dataset.imagesdataframe. Note that this makes the function significantly slower. Defaults to True.check_exhaustive – If set to True, will check that all images in the images_root folder are in the image dataframe, and that the dataset is indeed exhaustive
- assert_label_map_well_formed(dataset: Dataset) None[source]#
Assert label map has no category name duplicate
- Parameters:
dataset – dataset to test.
- assert_required_columns_present(input_df: DataFrame, required_columns: set[str], df_name: str) None[source]#
Simple function to check that required columns are present and raise a custom error if it’s not the case
- Parameters:
input_df – dataframe object to check.
required_columns – set of column names to find in the columns of
input_df.df_name – name of the dataframe, used to add context to the error message.
- Raises:
ValueError – Raised when not all required columns are present in the columns of
input_df.
- full_check_dataset_detection(dataset: Dataset, check_symlink: bool = False, allow_keypoints: bool = False, check_exhaustive: bool = False) None[source]#
Perform a full check of the dataset. Images must be reachable for the test to perform.
- Parameters:
dataset – dataset to test
check_symlink – If set to True, will check that image relative paths are indeed relative links and not actual files. Defaults to False.
allow_keypoints – If set to True, will not raise an error for bounding boxes with size 0 (width or height). Defaults to False.
check_exhaustive – If set to True, will check that all images in the images_root folder are in the image dataframe, and that the dataset is indeed exhaustive
- get_invalid_images(dataset: Dataset, check_symlink: bool = False, load_images: bool = True, check_exhaustive: bool = False, raise_if_error: bool = True) DataFrame[source]#
Checks dataset’s images and return an indexed error report to retrieve them.
Namely, checks that all path are indeed pointing to a file, and are valid file format that can be loaded with
imageio. If unsuccessful, add a row to the output dataframe with the same index as the faulty images, and info about the error in corresponding columns- Parameters:
dataset – Dataset to check
check_symlink – If set, will check that paths are symlinks rather than files. Defaults to False.
load_images – If set to True, will not only check that images are valid files, but also that image can be loaded (i.e. are not corrupted files) and that their sizes match the ones included in
dataset.imagesdataframe. Note that this makes the function significantly slower. Defaults to True.check_exhaustive – If set to True, will check that all images in the images_root folder are in the image dataframe, and that the dataset is indeed exhaustive
raise_if_error – If set to True, will raise an InvalidImage error as soon as one image does not meet the requirements.
- Raises:
InvalidImage – Raised if
raise_if_erroris selected and one image is not valid. Can be because the path is not right, the image loading failed, or the metadata is not compliant with actual image data.MissingImages – Raised if
raise_if_erroris selected and some images where found in theimages_rootfolder but not in the dataset’simagesdataframe.
- Returns:
Error report in the form of a Dataframe with “reason” and “additional_info” columns. Index values are the same as the corresponding images in the original dataset, so that you can retrieve the faulty images full data.
- get_malformed_bounding_boxes(dataset: Dataset, allow_keypoints: bool = False, raise_if_error: bool = False) DataFrame[source]#
Get malformed bounding in dataset’s annotations, as a boolean dataframe where index is id of bounding box in dataset’s annotations dataframe, and columns are known reasons for bounding boxes to be invalid
Boxes x and y coordinates must be within their respective image size
Boxes width and height must be positive and so that xmax and ymin are within their respective image size
in the case of keypoints, Boxes with size 0 will be tolerated
An invalid bounding box is then related to a row in the result dataframe where at least one of the value is True. Note that valid bounding boxes are NOT in the result dataframe. This means that if the dataset has no invalid bounding box, the result dataframe will be empty, and for each row in the result dataframe, there will be at least one
Truevalue.- Parameters:
dataset – Dataset to test
allow_keypoints – If set to True, will not raise error if bounding box size (width or height) is 0. Defaults to False.
raise_if_error – If set to True, will raise an error as soon as one bounding box is detected to be invalid. Defaults to False.
- Raises:
AssertionError – When
raise_if_erroris set, raise an error as soon as one bounding box is invalid.- Returns:
Error report as a dataframe with boolean columns.
Each column is a reason why the bounding box can be faulty.
Each row is a faulty bounding box, with its corresponding index in dataset’s annotation dataframe. Its value explain how the bounding box is invalid.
Only the faulty bounding boxes are kept in the error report, so all rows have at least one value set to True.