Dataset 101 : Mechanics#
This notebook will make use of lours’s data object. How to load from a known dataset format, how to merge two datasets, how to remap classes, and how to write it on disk in a wanted format
[1]:
%load_ext autoreload
%autoreload 2
from lours.dataset import Dataset, from_coco
from lours.utils.testing import assert_dataset_equal
Loading coco eval in test folders. Note that you can also load cAIpy and darknet.
[2]:
COCO_dataset = from_coco("notebook_data/coco_valid.json")
[3]:
COCO_dataset
Dataset Sampling#
You can use the loc[] or .iloc[] interface to sample the sub-datasets you want at the image level. To sample at the annotation level, you can use .loc_annot[] and .iloc_annot[] methods
Notes:
For
iloc, images indices are not considered, only the row number (like in pandas.DataFrame.iloc), so you might want to reorder the images before, or uselocthat uses indicescalling a single number, e.g.
dataset[0]will give you a dataset of only one image but it will still be a dataset object with two dataframesImages are never loaded by the dataset object itself, you need to load them yourself in your pipeline
the
[]method is equivalent toiloc[]
Image based sampling#
[4]:
# Only taking 50% of the images
COCO_dataset.iloc[::2]
[5]:
ids_to_keep = COCO_dataset.images.index[COCO_dataset.images.index > 30_000]
print(ids_to_keep)
COCO_dataset.loc[ids_to_keep]
Index([352582, 113354, 58393, 147729, 310072, 50149, 519208, 356125, 38048,
567825,
...
166478, 185409, 577976, 189806, 363188, 311180, 302030, 105455, 428280,
349837],
dtype='int64', name='id', length=4722)
This is equivalent to using filter_images method with loc mode
[6]:
COCO_dataset.filter_images(ids_to_keep, mode="loc")
Annotation based sampling#
Remove half the annotations
[7]:
COCO_dataset.iloc_annot[::2]
Remove half the annotations, remove images emptied of annotations (but keep the ones that were already empty)
[8]:
to_keep = COCO_dataset.annotations.index[::2]
filtered = COCO_dataset.filter_annotations(
to_keep, mode="loc", remove_emptied_images=True
)
display(filtered)
You can also use slice(None, None, 2) with the iloc mode
[9]:
filtered_2 = COCO_dataset.filter_annotations(
slice(None, None, 2), mode="iloc", remove_emptied_images=True
)
assert_dataset_equal(filtered, filtered_2)
Iterating through the dataset#
You can iterate through the dataset
[10]:
for single_image_dataset in COCO_dataset[:2]:
display(single_image_dataset)
The iter_image method can help you get directly image and annotations dataframes instead of Dataset objects with a single image)
[11]:
for image, annotations in COCO_dataset[:2].iter_images():
print(image)
display(annotations)
width 425
height 640
relative_path Images/valid/000000352582.jpg
type .jpg
split valid
Name: 352582, dtype: object
| image_id | category_str | category_id | split | box_x_min | box_y_min | box_width | box_height | area | |
|---|---|---|---|---|---|---|---|---|---|
| id | |||||||||
| 460450 | 352582 | person | 1 | valid | 112.43 | 195.32 | 214.78 | 438.19 | 48685.6791 |
| 535917 | 352582 | person | 1 | valid | 0.00 | 256.00 | 80.54 | 376.81 | 22650.7380 |
| 602093 | 352582 | frisbee | 34 | valid | 171.63 | 424.03 | 85.89 | 40.67 | 2605.7209 |
width 640
height 480
relative_path Images/valid/000000113354.jpg
type .jpg
split valid
Name: 113354, dtype: object
| image_id | category_str | category_id | split | box_x_min | box_y_min | box_width | box_height | area | |
|---|---|---|---|---|---|---|---|---|---|
| id | |||||||||
| 589077 | 113354 | zebra | 24 | valid | 260.99 | 158.88 | 141.52 | 194.11 | 9978.94125 |
| 589740 | 113354 | zebra | 24 | valid | 366.49 | 174.59 | 115.67 | 142.71 | 5784.68620 |
| 592005 | 113354 | zebra | 24 | valid | 3.24 | 151.28 | 265.34 | 175.82 | 16206.37480 |
[12]:
image, annotation = COCO_dataset[:2].get_one_frame(0)
print(image)
display(annotations)
width 425
height 640
relative_path Images/valid/000000352582.jpg
type .jpg
split valid
Name: 352582, dtype: object
| image_id | category_str | category_id | split | box_x_min | box_y_min | box_width | box_height | area | |
|---|---|---|---|---|---|---|---|---|---|
| id | |||||||||
| 589077 | 113354 | zebra | 24 | valid | 260.99 | 158.88 | 141.52 | 194.11 | 9978.94125 |
| 589740 | 113354 | zebra | 24 | valid | 366.49 | 174.59 | 115.67 | 142.71 | 5784.68620 |
| 592005 | 113354 | zebra | 24 | valid | 3.24 | 151.28 | 265.34 | 175.82 | 16206.37480 |
Remap classes#
Here we use the preset COCO -> Pascal to convert coco classes into Pascal’s annotation book
[13]:
COCO_dataset.label_map
[13]:
{1: 'person',
2: 'bicycle',
3: 'car',
4: 'motorcycle',
5: 'airplane',
6: 'bus',
7: 'train',
8: 'truck',
9: 'boat',
10: 'traffic light',
11: 'fire hydrant',
13: 'stop sign',
14: 'parking meter',
15: 'bench',
16: 'bird',
17: 'cat',
18: 'dog',
19: 'horse',
20: 'sheep',
21: 'cow',
22: 'elephant',
23: 'bear',
24: 'zebra',
25: 'giraffe',
27: 'backpack',
28: 'umbrella',
31: 'handbag',
32: 'tie',
33: 'suitcase',
34: 'frisbee',
35: 'skis',
36: 'snowboard',
37: 'sports ball',
38: 'kite',
39: 'baseball bat',
40: 'baseball glove',
41: 'skateboard',
42: 'surfboard',
43: 'tennis racket',
44: 'bottle',
46: 'wine glass',
47: 'cup',
48: 'fork',
49: 'knife',
50: 'spoon',
51: 'bowl',
52: 'banana',
53: 'apple',
54: 'sandwich',
55: 'orange',
56: 'broccoli',
57: 'carrot',
58: 'hot dog',
59: 'pizza',
60: 'donut',
61: 'cake',
62: 'chair',
63: 'couch',
64: 'potted plant',
65: 'bed',
67: 'dining table',
70: 'toilet',
72: 'tv',
73: 'laptop',
74: 'mouse',
75: 'remote',
76: 'keyboard',
77: 'cell phone',
78: 'microwave',
79: 'oven',
80: 'toaster',
81: 'sink',
82: 'refrigerator',
84: 'book',
85: 'clock',
86: 'vase',
87: 'scissors',
88: 'teddy bear',
89: 'hair drier',
90: 'toothbrush'}
[14]:
COCO_pascal = COCO_dataset.remap_from_preset("coco", "pascalvoc")
See how label map tab has changed
[15]:
COCO_pascal
Remap from dictionaries#
Fictional usecase where we want to only have vehicles, bags and animals. If given, new_names must be the length of distinct values in class_mapping
[16]:
COCO_RT = COCO_pascal.remap_classes(
class_mapping={
1: 2,
2: 2,
3: 1,
4: 1,
5: 3,
6: 2,
7: 2,
8: 1,
9: 3,
10: 1,
11: 3,
12: 1,
13: 1,
14: 2,
16: 3,
17: 1,
18: 3,
19: 2,
20: 3,
},
new_names={1: "Animal", 2: "Vehicle", 3: "Object"},
)
[17]:
COCO_RT
Remap from dataframe#
Dataframe for remapping must have at least 2 columns : input_category_id and output_category_id
If available, output_category_name will be use to replace the names of remapped ids.
input_category_name only serves an informative purpose.
[18]:
import pandas as pd
class_table = (
pd.Series(COCO_pascal.label_map).rename("input_category_name").sort_index()
)
class_table.index.rename("input_category_id", inplace=True)
class_table = class_table.reset_index().drop(15)
class_table["output_category_id"] = [
2,
2,
1,
2,
3,
2,
2,
1,
3,
1,
3,
1,
1,
2,
3,
1,
3,
2,
3,
]
class_table["output_category_name"] = class_table["output_category_id"].replace(
{1: "animal", 2: "vehicle", 3: "object"}
)
[19]:
class_table
[19]:
| input_category_id | input_category_name | output_category_id | output_category_name | |
|---|---|---|---|---|
| 0 | 1 | aeroplane | 2 | vehicle |
| 1 | 2 | bicycle | 2 | vehicle |
| 2 | 3 | bird | 1 | animal |
| 3 | 4 | boat | 2 | vehicle |
| 4 | 5 | bottle | 3 | object |
| 5 | 6 | bus | 2 | vehicle |
| 6 | 7 | car | 2 | vehicle |
| 7 | 8 | cat | 1 | animal |
| 8 | 9 | chair | 3 | object |
| 9 | 10 | cow | 1 | animal |
| 10 | 11 | diningtable | 3 | object |
| 11 | 12 | dog | 1 | animal |
| 12 | 13 | horse | 1 | animal |
| 13 | 14 | motorbike | 2 | vehicle |
| 14 | 15 | person | 3 | object |
| 16 | 17 | sheep | 1 | animal |
| 17 | 18 | sofa | 3 | object |
| 18 | 19 | train | 2 | vehicle |
| 19 | 20 | tvmonitor | 3 | object |
[20]:
COCO_RT_DF = COCO_pascal.remap_from_dataframe(class_table)
[21]:
COCO_RT_DF
Remap from CSV#
Basically the same as remap from dataframe, except the input is a csv file with the same data
[22]:
csv_file = "remap.csv"
class_table.to_csv(csv_file, index=False)
[23]:
!cat remap.csv
input_category_id,input_category_name,output_category_id,output_category_name
1,aeroplane,2,vehicle
2,bicycle,2,vehicle
3,bird,1,animal
4,boat,2,vehicle
5,bottle,3,object
6,bus,2,vehicle
7,car,2,vehicle
8,cat,1,animal
9,chair,3,object
10,cow,1,animal
11,diningtable,3,object
12,dog,1,animal
13,horse,1,animal
14,motorbike,2,vehicle
15,person,3,object
17,sheep,1,animal
18,sofa,3,object
19,train,2,vehicle
20,tvmonitor,3,object
[24]:
COCO_RT_CSV = COCO_pascal.remap_from_csv(csv_file)
[25]:
COCO_RT_CSV
Remap from other dataset#
This method will try to retrieve the label names in the other dataset and apply a remapping accordingly.
classes that are not in the other dataset are mapped to a free id with respect to the other dataset’s label map.
[26]:
COCO_RT_other = COCO_pascal.remap_from_other(COCO_RT_CSV)
COCO_RT_CSV
Using the following class remapping dictionary :
{1: 22,
2: 21,
3: 23,
4: 4,
5: 5,
6: 6,
7: 7,
8: 8,
9: 9,
10: 10,
11: 11,
12: 12,
13: 13,
14: 14,
15: 15,
16: 16,
17: 17,
18: 18,
19: 19,
20: 20}
Dataset Reindexing#
Resetting index#
The reset_index method allows you to reorder the dataset’s dataframes according to some column values
[27]:
COCO_dataset.reset_index()
Sort the annotations by category string first : Get the dataframe to start with airplanes and finish with zebra.
[28]:
reset_COCO_dataset = COCO_dataset.reset_index(
start_image_id=10,
start_annotations_id=2,
sort_annotations_by=("category_str", "image_id"),
)
reset_COCO_dataset
Reindex with mapping#
Akin to class remapping, you can also remap the dataset’s dataframe indexes with dictionaries. Note that unmapped index values will be reset to a range index, but they will not be sorted. Be sure to sort the dataframes the way you want before calling the method reset_index_from_mapping with an incomplete index mapping.
[29]:
COCO_dataset.reset_index_from_mapping(
images_index_map={58393: 0}, annotations_index_map={331107: 0}
)
Reindex images index from other dataframe#
This feature is similar to panda’s merge function : by selecting columns to merge on, the dataset will construct an index mapping for entries that are in both original images dataframe and the other dataframe, and optionally remap the other rows to a simple range index
[30]:
matched_COCO = COCO_dataset.match_index(reset_COCO_dataset.images, on="relative_path")
display(matched_COCO)
matched_COCO.images.sort_index()
[30]:
| width | height | relative_path | type | split | |
|---|---|---|---|---|---|
| id | |||||
| 10 | 640 | 426 | Images/valid/000000000139.jpg | .jpg | valid |
| 11 | 586 | 640 | Images/valid/000000000285.jpg | .jpg | valid |
| 12 | 640 | 483 | Images/valid/000000000632.jpg | .jpg | valid |
| 13 | 375 | 500 | Images/valid/000000000724.jpg | .jpg | valid |
| 14 | 428 | 640 | Images/valid/000000000776.jpg | .jpg | valid |
| ... | ... | ... | ... | ... | ... |
| 5005 | 640 | 354 | Images/valid/000000581317.jpg | .jpg | valid |
| 5006 | 612 | 612 | Images/valid/000000581357.jpg | .jpg | valid |
| 5007 | 640 | 427 | Images/valid/000000581482.jpg | .jpg | valid |
| 5008 | 478 | 640 | Images/valid/000000581615.jpg | .jpg | valid |
| 5009 | 640 | 478 | Images/valid/000000581781.jpg | .jpg | valid |
5000 rows × 5 columns
Dataset merge#
Regular merge#
Here, we divide COCO in two and merge them again to show how it works
[31]:
half1 = COCO_dataset[::2]
half2 = COCO_dataset[1::2]
[32]:
from lours.utils.testing import assert_dataset_equal
merged_back = half1 + half2
display(merged_back)
display(COCO_dataset)
assert_dataset_equal(COCO_dataset, merged_back)
Merge with ignore_index#
the merge function can be used with ignore_index when image ids are overlapping
[33]:
half1 = half1.reset_index()
half2 = half2.reset_index()
[34]:
merged_back = half1.merge(half2, ignore_index=True)
assert_dataset_equal(merged_back, COCO_dataset, ignore_index=True)
merged_back
Merging with overlapping ids#
If your datasets have images with overlapping ids, they can still be merged as long as the overlapping subset are the exact same
[35]:
half1 = Dataset.from_template(
COCO_dataset, annotations=COCO_dataset.annotations.iloc[::2]
)
display(half1)
half2 = Dataset.from_template(
COCO_dataset, annotations=COCO_dataset.annotations.iloc[1::2]
)
display(half2)
merged_back = half1 + half2
assert_dataset_equal(COCO_dataset, merged_back)
Merging overlapping ids can be turned off with allow_overlapping_ids set to False.
[36]:
half1.merge(half2, allow_overlapping_image_ids=False)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[36], line 1
----> 1 half1.merge(half2, allow_overlapping_image_ids=False)
File ~/workspace/Bamboo/lours/dataset/dataset.py:2803, in Dataset.merge(self, other, allow_overlapping_image_ids, realign_label_map, ignore_index, mark_origin, overwrite_origin)
2344 """Merge two datasets and return a unique dataset object containing
2345 Samples from both. Result's images_root will be the common path of both
2346 datasets, and the image relative paths will be updated accordingly.
(...)
2799
2800 """
2801 from .merge import merge_datasets
-> 2803 return merge_datasets(
2804 self,
2805 other,
2806 allow_overlapping_image_ids=allow_overlapping_image_ids,
2807 realign_label_map=realign_label_map,
2808 ignore_index=ignore_index,
2809 mark_origin=mark_origin,
2810 overwrite_origin=overwrite_origin,
2811 )
File ~/workspace/Bamboo/lours/dataset/merge.py:167, in merge_datasets(dataset1, dataset2, allow_overlapping_image_ids, realign_label_map, ignore_index, mark_origin, overwrite_origin)
164 mutual_images_columns = dataset1_images_columns & dataset2_images_columns
166 if mutual_images_ids and not allow_overlapping_image_ids:
--> 167 raise ValueError(
168 "Overlapping image ids not permitted. Consider using the"
169 " allow_overlapping_image_ids or ignore_index options"
170 )
172 assert_frame_intersections_equal(
173 dataset1_images.drop(["origin", "origin_id"], axis=1, errors="ignore"),
174 dataset2_images.drop(["origin", "origin_id"], axis=1, errors="ignore"),
175 )
177 # Concat horizontally by extending images from dataset1 with columns from dataset2
178 # and then vertically by extending images with dataset2 images which id is not
179 # in dataset1 images index.
ValueError: Overlapping image ids not permitted. Consider using the allow_overlapping_image_ids or ignore_index options
Incompatible Label maps#
In the case the label map of one dataset is not the subset of the other and vice versa, the label maps are incompatible.
[37]:
new_label_map = {**COCO_pascal.label_map, **{1: "something else"}}
COCO_incompatible = COCO_pascal.from_template(label_map=new_label_map)
[38]:
COCO_pascal.merge(COCO_incompatible)
---------------------------------------------------------------------------
IncompatibleLabelMapsError Traceback (most recent call last)
Cell In[38], line 1
----> 1 COCO_pascal.merge(COCO_incompatible)
File ~/workspace/Bamboo/lours/dataset/dataset.py:2803, in Dataset.merge(self, other, allow_overlapping_image_ids, realign_label_map, ignore_index, mark_origin, overwrite_origin)
2344 """Merge two datasets and return a unique dataset object containing
2345 Samples from both. Result's images_root will be the common path of both
2346 datasets, and the image relative paths will be updated accordingly.
(...)
2799
2800 """
2801 from .merge import merge_datasets
-> 2803 return merge_datasets(
2804 self,
2805 other,
2806 allow_overlapping_image_ids=allow_overlapping_image_ids,
2807 realign_label_map=realign_label_map,
2808 ignore_index=ignore_index,
2809 mark_origin=mark_origin,
2810 overwrite_origin=overwrite_origin,
2811 )
File ~/workspace/Bamboo/lours/dataset/merge.py:138, in merge_datasets(dataset1, dataset2, allow_overlapping_image_ids, realign_label_map, ignore_index, mark_origin, overwrite_origin)
134 label_map = merge_label_maps(
135 dataset1.label_map, dataset2.label_map, method="outer"
136 )
137 else:
--> 138 label_map = merge_label_maps(
139 dataset1.label_map, dataset2.label_map, method="outer"
140 )
142 dataset1_images, dataset2_images, booleanized_image_columns = (
143 broadcast_booleanization(
144 dataset1.images,
(...)
148 )
149 )
150 dataset1_annotations, dataset2_annotations, booleanized_annotations_columns = (
151 broadcast_booleanization(
152 dataset1.annotations,
(...)
156 )
157 )
File ~/workspace/Bamboo/lours/utils/label_map_merger.py:66, in merge_label_maps(left, right, method)
64 # The other way around when the other dataset's label map is the biggest
65 if {k: left[k] for k in intersection} != {k: right[k] for k in intersection}:
---> 66 raise IncompatibleLabelMapsError("Label maps are incompatible")
67 return left | right
68 else:
IncompatibleLabelMapsError: Label maps are incompatible
If we lookup the label map of SmartCity, we can see that class labels are not the same for class id 41 (dog vs domestic animal)
[39]:
for k, name in COCO_pascal.label_map.items():
other_name = COCO_incompatible.label_map.get(k)
if other_name is not None and other_name != name:
print(
f"Incompatible label map for category_id {k} : '{name}' vs '{other_name}'"
)
Incompatible label map for category_id 1 : 'aeroplane' vs 'something else'
Automatic remapping#
It is possible though to remap a dataset to match another dataset’s label map by retrieving categories with the same names.
We can use either the remap_from_other method or directly use the addition as it will fallback to the automatic remapping with a warning.
Note that the merge is effective but you should avoid this fallback mechanism if possible, because label names are not supposed to be used as ids.
[40]:
remapped = COCO_incompatible.remap_from_other(COCO_pascal)
merged = COCO_pascal.merge(remapped)
merged
Using the following class remapping dictionary :
{1: 21,
2: 2,
3: 3,
4: 4,
5: 5,
6: 6,
7: 7,
8: 8,
9: 9,
10: 10,
11: 11,
12: 12,
13: 13,
14: 14,
15: 15,
16: 16,
17: 17,
18: 18,
19: 19,
20: 20}
[41]:
merged = COCO_incompatible + COCO_pascal
merged
Using the following class remapping dictionary :
{1: 21,
2: 2,
3: 3,
4: 4,
5: 5,
6: 6,
7: 7,
8: 8,
9: 9,
10: 10,
11: 11,
12: 12,
13: 13,
14: 14,
15: 15,
16: 16,
17: 17,
18: 18,
19: 19,
20: 20}
/Users/clement.pinard/workspace/Bamboo/lours/dataset/dataset.py:2843: RuntimeWarning: Addition failed because of incompatible label maps, trying to remap classes of right value and retry the merge
warn(
Adding annotations to dataset#
Standalone annotation addition#
Similar to pandas.DataFrame.append, you can append one annotation row to your annotations dataframe.
Notice the box_format option which will let the method take care of the conversion itself. See lours.utils.bbox_converter for name conventions. For example yolo bboxes are giving box center x and y coordinates plus box height and width, all normalized with frame size. The format is thus cxcywh.
First, create a dataset with 2 images and no annotation
[42]:
empty = COCO_pascal.loc_annot[[]].iloc[:2]
display(empty)
Here, we add one bounding box, for the first image. the box is a quarter of the image (half the height and half the width) and is at the top-left corner of the image.
[43]:
empty.add_detection_annotation(
format_string="cxcywh",
image_id=352582,
bbox_coordinates=[0.75, 0.75, 0.5, 0.5],
confidence=0.5,
category_id=20,
)
Introduction the AnnotationAppender context manager#
Similarly to pandas.DataFrame.append, calling this method multiple times is discouraged, because each time it creates a new dataframe with only one more row.
What you can do instead is use the annotation_append method with a context manager. This appender will cache all the added annotation and will only append the consolidated data when exiting the context.
This is very useful when running an inference on a whole dataset.
Note that this operation is inplace !
[44]:
with empty.annotation_append(format_string="cxcywh") as appender:
appender.append(
image_id=352582,
bbox_coordinates=[0.75, 0.75, 0.5, 0.5],
confidence=0.5,
category_id=20,
)
appender.append(
image_id=113354,
bbox_coordinates=[0.25, 0.25, 0.5, 0.5],
confidence=0.5,
category_id=21,
)
print(empty.len_annot()) # Note that the dataset is not changed here
display(empty)
0
/Users/clement.pinard/workspace/Bamboo/lours/dataset/dataset.py:1004: UserWarning: Incomplete Label map, setting following label of the following id to their string equivalent : {21}
warn(