Dataset 101 : Mechanics#

This notebook will make use of lours’s data object. How to load from a known dataset format, how to merge two datasets, how to remap classes, and how to write it on disk in a wanted format

[1]:

%load_ext autoreload

%autoreload 2
from lours.dataset import Dataset, from_coco
from lours.utils.testing import assert_dataset_equal

Loading coco eval in test folders. Note that you can also load cAIpy and darknet.

[2]:

COCO_dataset = from_coco("notebook_data/coco_valid.json")

[3]:

COCO_dataset

Dataset Sampling#

You can use the loc[] or .iloc[] interface to sample the sub-datasets you want at the image level. To sample at the annotation level, you can use .loc_annot[] and .iloc_annot[] methods

Notes:

For iloc, images indices are not considered, only the row number (like in pandas.DataFrame.iloc), so you might want to reorder the images before, or use loc that uses indices
calling a single number, e.g. dataset[0] will give you a dataset of only one image but it will still be a dataset object with two dataframes
Images are never loaded by the dataset object itself, you need to load them yourself in your pipeline
the [] method is equivalent to iloc[]

Image based sampling#

[4]:

# Only taking 50% of the images
COCO_dataset.iloc[::2]

[5]:

ids_to_keep = COCO_dataset.images.index[COCO_dataset.images.index > 30_000]
print(ids_to_keep)
COCO_dataset.loc[ids_to_keep]

Index([352582, 113354,  58393, 147729, 310072,  50149, 519208, 356125,  38048,
       567825,
       ...
       166478, 185409, 577976, 189806, 363188, 311180, 302030, 105455, 428280,
       349837],
      dtype='int64', name='id', length=4722)

This is equivalent to using filter_images method with loc mode

[6]:

COCO_dataset.filter_images(ids_to_keep, mode="loc")

Annotation based sampling#

Remove half the annotations

[7]:

COCO_dataset.iloc_annot[::2]

Remove half the annotations, remove images emptied of annotations (but keep the ones that were already empty)

[8]:

to_keep = COCO_dataset.annotations.index[::2]
filtered = COCO_dataset.filter_annotations(
    to_keep, mode="loc", remove_emptied_images=True
)
display(filtered)

You can also use slice(None, None, 2) with the iloc mode

[9]:

filtered_2 = COCO_dataset.filter_annotations(
    slice(None, None, 2), mode="iloc", remove_emptied_images=True
)

assert_dataset_equal(filtered, filtered_2)

Iterating through the dataset#

You can iterate through the dataset

[10]:

for single_image_dataset in COCO_dataset[:2]:
    display(single_image_dataset)

The iter_image method can help you get directly image and annotations dataframes instead of Dataset objects with a single image)

[11]:

for image, annotations in COCO_dataset[:2].iter_images():
    print(image)
    display(annotations)

width                                      425
height                                     640
relative_path    Images/valid/000000352582.jpg
type                                      .jpg
split                                    valid
Name: 352582, dtype: object

	image_id	category_str	category_id	split	box_x_min	box_y_min	box_width	box_height	area
id
460450	352582	person	1	valid	112.43	195.32	214.78	438.19	48685.6791
535917	352582	person	1	valid	0.00	256.00	80.54	376.81	22650.7380
602093	352582	frisbee	34	valid	171.63	424.03	85.89	40.67	2605.7209

width                                      640
height                                     480
relative_path    Images/valid/000000113354.jpg
type                                      .jpg
split                                    valid
Name: 113354, dtype: object

	image_id	category_str	category_id	split	box_x_min	box_y_min	box_width	box_height	area
id
589077	113354	zebra	24	valid	260.99	158.88	141.52	194.11	9978.94125
589740	113354	zebra	24	valid	366.49	174.59	115.67	142.71	5784.68620
592005	113354	zebra	24	valid	3.24	151.28	265.34	175.82	16206.37480

[12]:

image, annotation = COCO_dataset[:2].get_one_frame(0)
print(image)
display(annotations)

width                                      425
height                                     640
relative_path    Images/valid/000000352582.jpg
type                                      .jpg
split                                    valid
Name: 352582, dtype: object

	image_id	category_str	category_id	split	box_x_min	box_y_min	box_width	box_height	area
id
589077	113354	zebra	24	valid	260.99	158.88	141.52	194.11	9978.94125
589740	113354	zebra	24	valid	366.49	174.59	115.67	142.71	5784.68620
592005	113354	zebra	24	valid	3.24	151.28	265.34	175.82	16206.37480

Remap classes#

Here we use the preset COCO -> Pascal to convert coco classes into Pascal’s annotation book

[13]:

COCO_dataset.label_map

[13]:

{1: 'person',
 2: 'bicycle',
 3: 'car',
 4: 'motorcycle',
 5: 'airplane',
 6: 'bus',
 7: 'train',
 8: 'truck',
 9: 'boat',
 10: 'traffic light',
 11: 'fire hydrant',
 13: 'stop sign',
 14: 'parking meter',
 15: 'bench',
 16: 'bird',
 17: 'cat',
 18: 'dog',
 19: 'horse',
 20: 'sheep',
 21: 'cow',
 22: 'elephant',
 23: 'bear',
 24: 'zebra',
 25: 'giraffe',
 27: 'backpack',
 28: 'umbrella',
 31: 'handbag',
 32: 'tie',
 33: 'suitcase',
 34: 'frisbee',
 35: 'skis',
 36: 'snowboard',
 37: 'sports ball',
 38: 'kite',
 39: 'baseball bat',
 40: 'baseball glove',
 41: 'skateboard',
 42: 'surfboard',
 43: 'tennis racket',
 44: 'bottle',
 46: 'wine glass',
 47: 'cup',
 48: 'fork',
 49: 'knife',
 50: 'spoon',
 51: 'bowl',
 52: 'banana',
 53: 'apple',
 54: 'sandwich',
 55: 'orange',
 56: 'broccoli',
 57: 'carrot',
 58: 'hot dog',
 59: 'pizza',
 60: 'donut',
 61: 'cake',
 62: 'chair',
 63: 'couch',
 64: 'potted plant',
 65: 'bed',
 67: 'dining table',
 70: 'toilet',
 72: 'tv',
 73: 'laptop',
 74: 'mouse',
 75: 'remote',
 76: 'keyboard',
 77: 'cell phone',
 78: 'microwave',
 79: 'oven',
 80: 'toaster',
 81: 'sink',
 82: 'refrigerator',
 84: 'book',
 85: 'clock',
 86: 'vase',
 87: 'scissors',
 88: 'teddy bear',
 89: 'hair drier',
 90: 'toothbrush'}

[14]:

COCO_pascal = COCO_dataset.remap_from_preset("coco", "pascalvoc")

See how label map tab has changed

[15]:

COCO_pascal

Remap from dictionaries#

Fictional usecase where we want to only have vehicles, bags and animals. If given, new_names must be the length of distinct values in class_mapping

[16]:

COCO_RT = COCO_pascal.remap_classes(
    class_mapping={
        1: 2,
        2: 2,
        3: 1,
        4: 1,
        5: 3,
        6: 2,
        7: 2,
        8: 1,
        9: 3,
        10: 1,
        11: 3,
        12: 1,
        13: 1,
        14: 2,
        16: 3,
        17: 1,
        18: 3,
        19: 2,
        20: 3,
    },
    new_names={1: "Animal", 2: "Vehicle", 3: "Object"},
)

[17]:

COCO_RT

Remap from dataframe#

Dataframe for remapping must have at least 2 columns : input_category_id and output_category_id

If available, output_category_name will be use to replace the names of remapped ids.

input_category_name only serves an informative purpose.

[18]:

import pandas as pd

class_table = (
    pd.Series(COCO_pascal.label_map).rename("input_category_name").sort_index()
)
class_table.index.rename("input_category_id", inplace=True)
class_table = class_table.reset_index().drop(15)
class_table["output_category_id"] = [
    2,
    2,
    1,
    2,
    3,
    2,
    2,
    1,
    3,
    1,
    3,
    1,
    1,
    2,
    3,
    1,
    3,
    2,
    3,
]
class_table["output_category_name"] = class_table["output_category_id"].replace(
    {1: "animal", 2: "vehicle", 3: "object"}
)

[19]:

class_table

[19]:

	input_category_id	input_category_name	output_category_id	output_category_name
0	1	aeroplane	2	vehicle
1	2	bicycle	2	vehicle
2	3	bird	1	animal
3	4	boat	2	vehicle
4	5	bottle	3	object
5	6	bus	2	vehicle
6	7	car	2	vehicle
7	8	cat	1	animal
8	9	chair	3	object
9	10	cow	1	animal
10	11	diningtable	3	object
11	12	dog	1	animal
12	13	horse	1	animal
13	14	motorbike	2	vehicle
14	15	person	3	object
16	17	sheep	1	animal
17	18	sofa	3	object
18	19	train	2	vehicle
19	20	tvmonitor	3	object

[20]:

COCO_RT_DF = COCO_pascal.remap_from_dataframe(class_table)

[21]:

COCO_RT_DF

Remap from CSV#

Basically the same as remap from dataframe, except the input is a csv file with the same data

[22]:

csv_file = "remap.csv"
class_table.to_csv(csv_file, index=False)

[23]:

!cat remap.csv

input_category_id,input_category_name,output_category_id,output_category_name
1,aeroplane,2,vehicle
2,bicycle,2,vehicle
3,bird,1,animal
4,boat,2,vehicle
5,bottle,3,object
6,bus,2,vehicle
7,car,2,vehicle
8,cat,1,animal
9,chair,3,object
10,cow,1,animal
11,diningtable,3,object
12,dog,1,animal
13,horse,1,animal
14,motorbike,2,vehicle
15,person,3,object
17,sheep,1,animal
18,sofa,3,object
19,train,2,vehicle
20,tvmonitor,3,object

[24]:

COCO_RT_CSV = COCO_pascal.remap_from_csv(csv_file)

[25]:

COCO_RT_CSV

Remap from other dataset#

This method will try to retrieve the label names in the other dataset and apply a remapping accordingly.

classes that are not in the other dataset are mapped to a free id with respect to the other dataset’s label map.

[26]:

COCO_RT_other = COCO_pascal.remap_from_other(COCO_RT_CSV)
COCO_RT_CSV

Using the following class remapping dictionary :
{1: 22,
 2: 21,
 3: 23,
 4: 4,
 5: 5,
 6: 6,
 7: 7,
 8: 8,
 9: 9,
 10: 10,
 11: 11,
 12: 12,
 13: 13,
 14: 14,
 15: 15,
 16: 16,
 17: 17,
 18: 18,
 19: 19,
 20: 20}

Dataset Reindexing#

Resetting index#

The reset_index method allows you to reorder the dataset’s dataframes according to some column values

[27]:

COCO_dataset.reset_index()

Sort the annotations by category string first : Get the dataframe to start with airplanes and finish with zebra.

[28]:

reset_COCO_dataset = COCO_dataset.reset_index(
    start_image_id=10,
    start_annotations_id=2,
    sort_annotations_by=("category_str", "image_id"),
)
reset_COCO_dataset

Reindex with mapping#

Akin to class remapping, you can also remap the dataset’s dataframe indexes with dictionaries. Note that unmapped index values will be reset to a range index, but they will not be sorted. Be sure to sort the dataframes the way you want before calling the method reset_index_from_mapping with an incomplete index mapping.

[29]:

COCO_dataset.reset_index_from_mapping(
    images_index_map={58393: 0}, annotations_index_map={331107: 0}
)

Reindex images index from other dataframe#

This feature is similar to panda’s merge function : by selecting columns to merge on, the dataset will construct an index mapping for entries that are in both original images dataframe and the other dataframe, and optionally remap the other rows to a simple range index

[30]:

matched_COCO = COCO_dataset.match_index(reset_COCO_dataset.images, on="relative_path")
display(matched_COCO)

matched_COCO.images.sort_index()

[30]:

	width	height	relative_path	type	split
id
10	640	426	Images/valid/000000000139.jpg	.jpg	valid
11	586	640	Images/valid/000000000285.jpg	.jpg	valid
12	640	483	Images/valid/000000000632.jpg	.jpg	valid
13	375	500	Images/valid/000000000724.jpg	.jpg	valid
14	428	640	Images/valid/000000000776.jpg	.jpg	valid
...	...	...	...	...	...
5005	640	354	Images/valid/000000581317.jpg	.jpg	valid
5006	612	612	Images/valid/000000581357.jpg	.jpg	valid
5007	640	427	Images/valid/000000581482.jpg	.jpg	valid
5008	478	640	Images/valid/000000581615.jpg	.jpg	valid
5009	640	478	Images/valid/000000581781.jpg	.jpg	valid

5000 rows × 5 columns

Dataset merge#

Regular merge#

Here, we divide COCO in two and merge them again to show how it works

[31]:

half1 = COCO_dataset[::2]
half2 = COCO_dataset[1::2]

[32]:

from lours.utils.testing import assert_dataset_equal

merged_back = half1 + half2
display(merged_back)
display(COCO_dataset)
assert_dataset_equal(COCO_dataset, merged_back)

Merge with `ignore_index`#

the merge function can be used with ignore_index when image ids are overlapping

[33]:

half1 = half1.reset_index()
half2 = half2.reset_index()

[34]:

merged_back = half1.merge(half2, ignore_index=True)
assert_dataset_equal(merged_back, COCO_dataset, ignore_index=True)
merged_back

Merging with overlapping ids#

If your datasets have images with overlapping ids, they can still be merged as long as the overlapping subset are the exact same

[35]:

half1 = Dataset.from_template(
    COCO_dataset, annotations=COCO_dataset.annotations.iloc[::2]
)
display(half1)
half2 = Dataset.from_template(
    COCO_dataset, annotations=COCO_dataset.annotations.iloc[1::2]
)
display(half2)
merged_back = half1 + half2
assert_dataset_equal(COCO_dataset, merged_back)

Merging overlapping ids can be turned off with allow_overlapping_ids set to False.

[36]:

half1.merge(half2, allow_overlapping_image_ids=False)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[36], line 1
----> 1 half1.merge(half2, allow_overlapping_image_ids=False)

File ~/workspace/Bamboo/lours/dataset/dataset.py:2803, in Dataset.merge(self, other, allow_overlapping_image_ids, realign_label_map, ignore_index, mark_origin, overwrite_origin)
   2344 """Merge two datasets and return a unique dataset object containing
   2345 Samples from both. Result's images_root will be the common path of both
   2346 datasets, and the image relative paths will be updated accordingly.
   (...)
   2799
   2800 """
   2801 from .merge import merge_datasets
-> 2803 return merge_datasets(
   2804     self,
   2805     other,
   2806     allow_overlapping_image_ids=allow_overlapping_image_ids,
   2807     realign_label_map=realign_label_map,
   2808     ignore_index=ignore_index,
   2809     mark_origin=mark_origin,
   2810     overwrite_origin=overwrite_origin,
   2811 )

File ~/workspace/Bamboo/lours/dataset/merge.py:167, in merge_datasets(dataset1, dataset2, allow_overlapping_image_ids, realign_label_map, ignore_index, mark_origin, overwrite_origin)
    164 mutual_images_columns = dataset1_images_columns & dataset2_images_columns
    166 if mutual_images_ids and not allow_overlapping_image_ids:
--> 167     raise ValueError(
    168         "Overlapping image ids not permitted. Consider using the"
    169         " allow_overlapping_image_ids or ignore_index options"
    170     )
    172 assert_frame_intersections_equal(
    173     dataset1_images.drop(["origin", "origin_id"], axis=1, errors="ignore"),
    174     dataset2_images.drop(["origin", "origin_id"], axis=1, errors="ignore"),
    175 )
    177 # Concat horizontally by extending images from dataset1 with columns from dataset2
    178 # and then vertically by extending images with dataset2 images which id is not
    179 # in dataset1 images index.

ValueError: Overlapping image ids not permitted. Consider using the allow_overlapping_image_ids or ignore_index options

Incompatible Label maps#

In the case the label map of one dataset is not the subset of the other and vice versa, the label maps are incompatible.

[37]:

new_label_map = {**COCO_pascal.label_map, **{1: "something else"}}
COCO_incompatible = COCO_pascal.from_template(label_map=new_label_map)

[38]:

COCO_pascal.merge(COCO_incompatible)

---------------------------------------------------------------------------
IncompatibleLabelMapsError                Traceback (most recent call last)
Cell In[38], line 1
----> 1 COCO_pascal.merge(COCO_incompatible)

File ~/workspace/Bamboo/lours/dataset/dataset.py:2803, in Dataset.merge(self, other, allow_overlapping_image_ids, realign_label_map, ignore_index, mark_origin, overwrite_origin)
   2344 """Merge two datasets and return a unique dataset object containing
   2345 Samples from both. Result's images_root will be the common path of both
   2346 datasets, and the image relative paths will be updated accordingly.
   (...)
   2799
   2800 """
   2801 from .merge import merge_datasets
-> 2803 return merge_datasets(
   2804     self,
   2805     other,
   2806     allow_overlapping_image_ids=allow_overlapping_image_ids,
   2807     realign_label_map=realign_label_map,
   2808     ignore_index=ignore_index,
   2809     mark_origin=mark_origin,
   2810     overwrite_origin=overwrite_origin,
   2811 )

File ~/workspace/Bamboo/lours/dataset/merge.py:138, in merge_datasets(dataset1, dataset2, allow_overlapping_image_ids, realign_label_map, ignore_index, mark_origin, overwrite_origin)
    134     label_map = merge_label_maps(
    135         dataset1.label_map, dataset2.label_map, method="outer"
    136     )
    137 else:
--> 138     label_map = merge_label_maps(
    139         dataset1.label_map, dataset2.label_map, method="outer"
    140     )
    142 dataset1_images, dataset2_images, booleanized_image_columns = (
    143     broadcast_booleanization(
    144         dataset1.images,
   (...)
    148     )
    149 )
    150 dataset1_annotations, dataset2_annotations, booleanized_annotations_columns = (
    151     broadcast_booleanization(
    152         dataset1.annotations,
   (...)
    156     )
    157 )

File ~/workspace/Bamboo/lours/utils/label_map_merger.py:66, in merge_label_maps(left, right, method)
     64     # The other way around when the other dataset's label map is the biggest
     65     if {k: left[k] for k in intersection} != {k: right[k] for k in intersection}:
---> 66         raise IncompatibleLabelMapsError("Label maps are incompatible")
     67     return left | right
     68 else:

IncompatibleLabelMapsError: Label maps are incompatible

If we lookup the label map of SmartCity, we can see that class labels are not the same for class id 41 (dog vs domestic animal)

[39]:

for k, name in COCO_pascal.label_map.items():
    other_name = COCO_incompatible.label_map.get(k)
    if other_name is not None and other_name != name:
        print(
            f"Incompatible label map for category_id {k} : '{name}' vs '{other_name}'"
        )

Incompatible label map for category_id 1 : 'aeroplane' vs 'something else'

Automatic remapping#

It is possible though to remap a dataset to match another dataset’s label map by retrieving categories with the same names.

We can use either the remap_from_other method or directly use the addition as it will fallback to the automatic remapping with a warning.

Note that the merge is effective but you should avoid this fallback mechanism if possible, because label names are not supposed to be used as ids.

[40]:

remapped = COCO_incompatible.remap_from_other(COCO_pascal)
merged = COCO_pascal.merge(remapped)
merged

Using the following class remapping dictionary :
{1: 21,
 2: 2,
 3: 3,
 4: 4,
 5: 5,
 6: 6,
 7: 7,
 8: 8,
 9: 9,
 10: 10,
 11: 11,
 12: 12,
 13: 13,
 14: 14,
 15: 15,
 16: 16,
 17: 17,
 18: 18,
 19: 19,
 20: 20}

[41]:

merged = COCO_incompatible + COCO_pascal
merged

Using the following class remapping dictionary :
{1: 21,
 2: 2,
 3: 3,
 4: 4,
 5: 5,
 6: 6,
 7: 7,
 8: 8,
 9: 9,
 10: 10,
 11: 11,
 12: 12,
 13: 13,
 14: 14,
 15: 15,
 16: 16,
 17: 17,
 18: 18,
 19: 19,
 20: 20}

/Users/clement.pinard/workspace/Bamboo/lours/dataset/dataset.py:2843: RuntimeWarning: Addition failed because of incompatible label maps, trying to remap classes of right value and retry the merge
  warn(

Adding annotations to dataset#

Standalone annotation addition#

Similar to pandas.DataFrame.append, you can append one annotation row to your annotations dataframe.

Notice the box_format option which will let the method take care of the conversion itself. See lours.utils.bbox_converter for name conventions. For example yolo bboxes are giving box center x and y coordinates plus box height and width, all normalized with frame size. The format is thus cxcywh.

First, create a dataset with 2 images and no annotation

[42]:

empty = COCO_pascal.loc_annot[[]].iloc[:2]
display(empty)

Here, we add one bounding box, for the first image. the box is a quarter of the image (half the height and half the width) and is at the top-left corner of the image.

[43]:

empty.add_detection_annotation(
    format_string="cxcywh",
    image_id=352582,
    bbox_coordinates=[0.75, 0.75, 0.5, 0.5],
    confidence=0.5,
    category_id=20,
)

Introduction the AnnotationAppender context manager#

Similarly to pandas.DataFrame.append, calling this method multiple times is discouraged, because each time it creates a new dataframe with only one more row.

What you can do instead is use the annotation_append method with a context manager. This appender will cache all the added annotation and will only append the consolidated data when exiting the context.

This is very useful when running an inference on a whole dataset.

Note that this operation is inplace !

[44]:

with empty.annotation_append(format_string="cxcywh") as appender:
    appender.append(
        image_id=352582,
        bbox_coordinates=[0.75, 0.75, 0.5, 0.5],
        confidence=0.5,
        category_id=20,
    )
    appender.append(
        image_id=113354,
        bbox_coordinates=[0.25, 0.25, 0.5, 0.5],
        confidence=0.5,
        category_id=21,
    )
    print(empty.len_annot())  # Note that the dataset is not changed here

display(empty)

/Users/clement.pinard/workspace/Bamboo/lours/dataset/dataset.py:1004: UserWarning: Incomplete Label map, setting following label of the following id to their string equivalent : {21}
  warn(