Dataset#

class Dataset(images_root: Path | None = None, images: DataFrame | None = None, annotations: DataFrame | None = None, label_map: dict[int, str] | None = None, dataset_name: str | None = None)[source]#

Bases: object

Dataset base class for manipulation

The behaviour of the Dataset is inspired from numpy arrays or pandas dataframes.

See also

Main Constructor

Parameters:
  • images_root – root path from where the relative_path values are relative to, in images

  • images – DataFrame comprising image data. This dataframe should be referred to by annotations with the image_id column

  • annotations – DataFrame comprising annotation data. Must have at least image_id column

  • label_map – Mapping from category_id to category_str, in the case the annotations have a category_id id. Useful for detections and classification

  • dataset_name – Optional name for dataset. Will be used in function that need a name when the name cannot be easily deduced from images_root

See also

from_template()

Example

>>> Dataset()
Dataset object containing 0 image and 0 object
Name :
    None
Images root :
    .
Images :
Empty DataFrame
Columns: [width, height, relative_path, type]
Index: []
Annotations :
Empty DataFrame
Columns: [image_id, category_str, category_id, box_x_min, box_y_min, box_width, box_height]
Index: []
Label map :
{}
>>> images = pd.DataFrame(
...     data={
...         "width": [1920, 1280],
...         "height": [1080, 720],
...         "relative_path": [Path("0.jpg"), Path("1.jpg")],
...         "split": ["train", "valid"],
...     },
...     index=[0, 1],
... )
>>> annotations = pd.DataFrame(
...     data={
...         "image_id": [0, 1],
...         "category_id": [1, 0],
...         "box_x_min": [10, 20],
...         "box_y_min": [30, 40],
...         "box_width": [100, 200],
...         "box_height": [200, 300],
...     },
...     index=[2, 3],
... )
>>> label_map = {0: "this", 1: "that"}
>>> Dataset(
...     images=images,
...     annotations=annotations,
...     label_map=label_map,
...     dataset_name="my_dataset",
... )
Dataset object containing 2 images and 2 objects
Name :
    my_dataset
Images root :
    .
Images :
    width  height relative_path  type  split
id
0    1920    1080         0.jpg  .jpg  train
1    1280     720         1.jpg  .jpg  valid
Annotations :
    image_id category_str  category_id  ... box_y_min  box_width  box_height
id                                      ...
2          0         that            1  ...      30.0      100.0       200.0
3          1         this            0  ...      40.0      200.0       300.0

[2 rows x 8 columns]
Label map :
{0: 'this', 1: 'that'}

Attributes

booleanized_columns: dict[str, set[str]] = {'annotations': {}, 'images': {}}#
dataset_name: str | None#
images_root: Path#
images: DataFrame#
annotations: DataFrame#
label_map: dict[int, str]#

Methods

__getitem__(args)

__getitem__ implementation for the Dataset object.

__len__()

Return number of images in dataset.

add_detection_annotation(image_id, ...[, ...])

Add one or multiple detection annotations to the current dataset.

annotation_append([format_string, ...])

Create a context manager to add detection tensors to the current dataset with the AnnotationAppender.append() method, as if the Dataset was a list.

booleanize([column_names, missing_ok])

Convert given column in self.images or self.annotations from lists to columns of booleans.

cap_bounding_box_coordinates()

Method to ensure the bounding box coordinates are inside the picture frame.

check([check_symlink, allow_keypoints, ...])

Make a full check of dataset, Ids, Bounding boxes, label maps and images

debooleanize([dataframe])

Convert booleanized columns back to list form, for exporting purpose.

empty_annotations()

Create a dataset object with an empty annotation dataframe, but with the same columns, and the same images dataframe.

filter_annotations(index[, mode, ...])

Method equivalent of loc_annot and iloc_annot, except you can choose to remove emptied images as well.

filter_images(index[, mode])

Method equivalent of Dataset.loc and Dataset.iloc

from_template([reset_booleanized])

Create a new Dataset object from an existing Dataset.

get_annotations_attributes()

Get the name of columns related to annotations attributes.

get_image_attributes()

Get the name of columns related to image attributes.

get_one_frame(n)

Sample a single image from the dataset.

get_split(split)

Get a particular split from the dataset

init_annotations()

Initialize annotations by adding info and checking index

init_images()

Initialize images by checking required fields are present and converting fields to the right dtype.

iter_images()

Iterate through images, by yielding

iter_splits()

Iterate though split values of the dataset, by yielding for each split the split name and the corresponding sub-dataset.

keep_classes(to_keep[, remove_emptied_images])

Perform a simple remapping, where given classes kept, and other are removed

len_annot()

Return number of annotations in total

match_index(other_images[, on, remove_unmatched])

Reindex a dataset from another images DataFrame.

merge(other[, allow_overlapping_image_ids, ...])

Merge two datasets and return a unique dataset object containing Samples from both.

remap_classes(class_mapping[, new_names, ...])

Remap classes ids and names according to a dictionary

remap_from_csv(csv[, remove_not_mapped, ...])

Same as class remap, but instead of taking a dictionary, you give the path to a csv file.

remap_from_dataframe(df[, ...])

Same as class remap, but instead of taking a dictionary, you give a dataframe.

remap_from_other(other[, remove_not_mapped, ...])

Try to remap classes of dataset to match the ones in another dataset by retrieving categories with the same name.

remap_from_preset(input_dataset_map, ...[, ...])

Same as class remap, but instead of taking a dictionary, you give the name of a preset.

remove_classes(to_remove[, ...])

Perform a simple remapping, where given classes are removed

remove_empty_images()

Remove images without annotations from dataset.

remove_invalid_annotations([...])

Remove Invalid annotations from dataset.

remove_invalid_images([load_images])

Remove invalid images from dataset.

rename(dataset_name)

Simple function to change the name fo the dataset.

reset_images_root(new_path)

Replace the images_root with a new path.

reset_index([start_image_id, ...])

Reset index of self.images dataframe, and reset index of self.annotations However, keep the 'image_id' column in self.annotations pointing to the right rows in the self.images dataframe.

reset_index_from_mapping([images_index_map, ...])

Reset index of images and annotations dataframe with index maps (index -> new_index) where the value is new index to apply.

simple_split([input_seed, split_names, ...])

Simple version of splitting method, splitting images randomly.

split([input_seed, split_names, ...])

Perform the split operation on annotations and images.

to_caipy(output_path[, use_schema, ...])

Convert dataset to cAIpy format.

to_caipy_generic(output_images_folder, ...)

Convert dataset to cAIpy format, but with the possibility to specify images and annotations folders rather than a root folder with Images and Annotations sub-folders.

to_coco(output_path[, copy_images, to_jpg, ...])

Save dataset in coco format.

to_darknet(output_path[, copy_images, ...])

Save dataset in darknet format, readable by darknet .

to_fiftyone([dataset_name, ...])

Convert the dataset into a fiftyone dataset, that can then be inspected with Fiftyone's webapp.

to_parquet(output_dir[, overwrite])

Save dataset object to a folder containing parquet files for dataframes and a metadata.yaml file for other attributes.

to_yolov5(output_path[, copy_images, ...])

Save dataset in format readable by Yolov5 .

to_yolov7(output_path[, copy_images, ...])

Save dataset in format readable by Yolov7 .

iloc

Filter a dataset by indexing the images you want with their row number.

iloc_annot

Filter a dataset by indexing the annotations you want with their row number.

loc

Filter a dataset by indexing the images you want with their ids

loc_annot

Filter a dataset by indexing the annotations you want with their id.