merge#
- Dataset.merge(other: Dataset, allow_overlapping_image_ids: bool = True, realign_label_map: bool = False, ignore_index: bool = False, mark_origin: bool = False, overwrite_origin: bool = False) Dataset[source]#
Merge two datasets and return a unique dataset object containing Samples from both. Result’s images_root will be the common path of both datasets, and the image relative paths will be updated accordingly. Result’s label map will be the superset of both label map, provided one is included in the other.
Notes
This function is also usable with the + operator
If possible, booleanized columns for images and annotations will be broadcast together. See
lours.utils.column_booleanizer.broadcast_booleanization()If one of the dataset has an absolute path as
images_root, the other dataset images root path will also be converted to absolute.If both datasets have the same name, the output will have the same name as well.
If datasets have a different name, the output will have the concatenation of both names separate by a “+” sign. The merge output of “A” and “B” will be thus names “A+B”.
If one dataset has no name (
dataset.nameisNone), the output will take the name of the other.If
mark_originis selected, it will be effective only if datasets have different actual names (notNone)
- Parameters:
other – Other dataset to merge with. This dataset must be compatible with the first one, i.e. one label map is included with the other, and image and annotation ids are mutually exclusives between datasets (unless ignore_index is False)
allow_overlapping_image_ids – if set to True, will try to join images dataframes with overlapping ids. The whole rows (i.e. with values from columns present in both dataframes) must match, as well as the images_root. In that case, annotations with this image_id (from self or other) will be assumed to come from the same image. Defaults to True
realign_label_map – If set to True, will try to remap classes of other dataset to match this dataset’s label map, to avoid a potential error due to incompatible label maps.
ignore_index – if set to True, will ignore overlapping ids for images and annotations and reset them. Will update the
image_idcolumn in the annotations accordingly. Note that this option makes the former option useless. Defaults to False.mark_origin – If set to True, and if both datasets have a different name, will add two columns “origin” and “origin_id” for images and annotations dataframes, indicating respectively the name of the origin dataset, and its id in the original dataset. Defaults to True.
overwrite_origin – If set to True, will overwrite already existing columns in input datasets dataframes. Otherwise, will only mark origin if it’s not present. Defaults to False.
- Raises:
ValueError – Error if the two datasets are incompatible (see above)
- Returns:
Merged dataset.
Example
>>> from lours.utils.doc_utils import dummy_dataset >>> example1 = dummy_dataset(2, 2, seed=0) >>> example2 = dummy_dataset(2, 2, seed=1) >>> example1 Dataset object containing 2 images and 2 objects Name : inside_else_memory Images root : such/serious Images : width height relative_path type split id 0 342 136 help/me.jpeg .jpeg train 1 377 167 whatever/wait.png .png train Annotations : image_id category_str category_id ... box_y_min box_width box_height id ... 0 0 step 15 ... 73.932999 71.552480 42.673983 1 0 why 19 ... 4.567638 248.551257 122.602211 [2 rows x 8 columns] Label map : {15: 'step', 19: 'why', 25: 'interview'} >>> example2 Dataset object containing 2 images and 2 objects Name : shake_effort_many Images root : care/suggest Images : width height relative_path type split id 0 955 229 determine/story.jpg .jpg train 1 131 840 air/method.bmp .bmp train Annotations : image_id category_str category_id ... box_y_min box_width box_height id ... 0 1 listen 14 ... 276.974642 9.718823 184.684056 1 0 reach 22 ... 6.311037 123.141689 174.239136 [2 rows x 8 columns] Label map : {14: 'listen', 15: 'marriage', 22: 'reach'}
Notice how the two label maps have overlapping index (the id 15)
>>> example1 + example2 Using the following class remapping dictionary : {14: 14, 15: 16, 22: 22} Dataset object containing 4 images and 4 objects Name : inside_else_memory+shake_effort_many Images root : . Images : width height relative_path type split id 0 342 136 such/serious/help/me.jpeg .jpeg train 1 377 167 such/serious/whatever/wait.png .png train 2 131 840 care/suggest/air/method.bmp .bmp train 3 955 229 care/suggest/determine/story.jpg .jpg train Annotations : image_id category_str category_id ... box_y_min box_width box_height id ... 0 0 step 15 ... 73.932999 71.552480 42.673983 1 0 why 19 ... 4.567638 248.551257 122.602211 2 2 listen 14 ... 276.974642 9.718823 184.684056 3 3 reach 22 ... 6.311037 123.141689 174.239136 [4 rows x 8 columns] Label map : {14: 'listen', 15: 'step', 16: 'marriage', 19: 'why', 22: 'reach', 25: 'interview'}
>>> example1.merge(example2, realign_label_map=False) Traceback (most recent call last): ... lours.utils.label_map_merger.IncompatibleLabelMapsError: Label maps are incompatible
>>> example1.merge( ... example2, realign_label_map=True, allow_overlapping_image_ids=False ... ) Traceback (most recent call last): ... ValueError: Overlapping image ids not permitted. Consider using the allow_overlapping_image_ids or ignore_index options
This will raise an error because overlapping image ids is possible only if the rows are compatible : fields that are present in both rows have the same value
>>> example1.merge( ... example2, realign_label_map=True, allow_overlapping_image_ids=True ... ) Traceback (most recent call last): ... AssertionError: sub-Dataframes constructed from ids and columns in both DataFrames are not equal.
The only way to merge these datasets is to remap the label map and then reset the indexes with the option
ignore_indexset toTrue, similar topandas.concat().>>> example1.merge( ... example2.remap_classes({15: 1}, remove_not_mapped=False), ... ignore_index=True, ... ) Dataset object containing 4 images and 4 objects Name : inside_else_memory+shake_effort_many Images root : . Images : width height relative_path type split id 0 342 136 such/serious/help/me.jpeg .jpeg train 1 377 167 such/serious/whatever/wait.png .png train 2 131 840 care/suggest/air/method.bmp .bmp train 3 955 229 care/suggest/determine/story.jpg .jpg train Annotations : image_id category_str category_id ... box_y_min box_width box_height id ... 0 0 step 15 ... 73.932999 71.552480 42.673983 1 0 why 19 ... 4.567638 248.551257 122.602211 2 2 listen 14 ... 276.974642 9.718823 184.684056 3 3 reach 22 ... 6.311037 123.141689 174.239136 [4 rows x 8 columns] Label map : {1: 'marriage', 14: 'listen', 15: 'step', 19: 'why', 22: 'reach', 25: 'interview'}
Let’s construct two datasets sharing image info and label maps
>>> example = dummy_dataset(5, 5, seed=0) >>> example1 = example.iloc_annot[::2].iloc[1:] >>> example2 = example.iloc_annot[1::2].iloc[:-1]
>>> example1 Dataset object containing 4 images and 3 objects Name : inside_else_memory Images root : such/serious Images : width height relative_path type split id 1 377 831 whatever/wait.png .png train 2 136 684 chair/mother.gif .gif train 3 167 921 someone/challenge.jpeg .jpeg train 4 114 553 successful/present.bmp .bmp train Annotations : image_id category_str category_id ... box_y_min box_width box_height id ... 0 3 why 19 ... 498.685784 31.192237 404.663563 2 3 interview 25 ... 389.294931 19.083146 209.778063 4 2 step 15 ... 85.009761 18.228218 181.012493 [3 rows x 8 columns] Label map : {15: 'step', 19: 'why', 25: 'interview'} >>> example2 Dataset object containing 4 images and 1 object Name : inside_else_memory Images root : such/serious Images : width height relative_path type split id 0 342 257 help/me.jpeg .jpeg train 1 377 831 whatever/wait.png .png train 2 136 684 chair/mother.gif .gif train 3 167 921 someone/challenge.jpeg .jpeg train Annotations : image_id category_str category_id ... box_y_min box_width box_height id ... 3 3 step 15 ... 26.082417 34.739663 607.977022 [1 rows x 8 columns] Label map : {15: 'step', 19: 'why', 25: 'interview'} >>> example1.merge(example2) Dataset object containing 5 images and 4 objects Name : inside_else_memory Images root : such/serious Images : width height relative_path type split id 1 377 831 whatever/wait.png .png train 2 136 684 chair/mother.gif .gif train 3 167 921 someone/challenge.jpeg .jpeg train 4 114 553 successful/present.bmp .bmp train 0 342 257 help/me.jpeg .jpeg train Annotations : image_id category_str category_id ... box_y_min box_width box_height id ... 0 3 why 19 ... 498.685784 31.192237 404.663563 2 3 interview 25 ... 389.294931 19.083146 209.778063 4 2 step 15 ... 85.009761 18.228218 181.012493 3 3 step 15 ... 26.082417 34.739663 607.977022 [4 rows x 8 columns] Label map : {15: 'step', 19: 'why', 25: 'interview'}
See that if we use the
ignore_indexoption, the images are duplicated because it is assumed the two images dataframes don’t have any overlap.>>> example1.merge(example2, ignore_index=True) Dataset object containing 8 images and 4 objects Name : inside_else_memory Images root : such/serious Images : width height relative_path type split id 0 136 684 chair/mother.gif .gif train 1 167 921 someone/challenge.jpeg .jpeg train 2 114 553 successful/present.bmp .bmp train 3 377 831 whatever/wait.png .png train 4 136 684 chair/mother.gif .gif train 5 342 257 help/me.jpeg .jpeg train 6 167 921 someone/challenge.jpeg .jpeg train 7 377 831 whatever/wait.png .png train Annotations : image_id category_str category_id ... box_y_min box_width box_height id ... 0 0 step 15 ... 85.009761 18.228218 181.012493 1 1 why 19 ... 498.685784 31.192237 404.663563 2 1 interview 25 ... 389.294931 19.083146 209.778063 3 6 step 15 ... 26.082417 34.739663 607.977022 [4 rows x 8 columns] Label map : {15: 'step', 19: 'why', 25: 'interview'}
Finally, you can mark the origin of your datasets in dedicated columns in the resulting dataset’s dataframes.
>>> example1 = dummy_dataset( ... 2, 2, seed=0, label_map={0: "car"}, dataset_name="A" ... ) >>> example2 = dummy_dataset( ... 2, 2, seed=1, label_map={0: "car"}, dataset_name="B" ... ) >>> example1 Dataset object containing 2 images and 2 objects Name : A Images root : such/serious Images : width height relative_path type split id 0 865 560 step/why.jpg .jpg train 1 673 342 help/me.jpeg .jpeg val Annotations : image_id category_str category_id ... box_y_min box_width box_height id ... 0 0 car 0 ... 511.143123 616.718121 12.497434 1 0 car 0 ... 339.716034 233.243139 117.161956 [2 rows x 8 columns] Label map : {0: 'car'} >>> example2 Dataset object containing 2 images and 2 objects Name : B Images root : care/suggest Images : width height relative_path type split id 0 525 779 reach/marriage.jpg .jpg train 1 560 955 determine/story.jpg .jpg train Annotations : image_id category_str category_id ... box_y_min box_width box_height id ... 0 0 car 0 ... 21.468549 283.211413 308.302755 1 0 car 0 ... 586.986712 124.825174 57.793609 [2 rows x 8 columns] Label map : {0: 'car'} >>> merged_examples = example1.merge( ... example2, mark_origin=True, ignore_index=True ... ) >>> merged_examples Dataset object containing 4 images and 4 objects Name : A+B Images root : . Images : width height relative_path ... split origin origin_id id ... 0 673 342 such/serious/help/me.jpeg ... val A 1 1 865 560 such/serious/step/why.jpg ... train A 0 2 560 955 care/suggest/determine/story.jpg ... train B 1 3 525 779 care/suggest/reach/marriage.jpg ... train B 0 [4 rows x 7 columns] Annotations : image_id category_str category_id ... box_height origin origin_id id ... 0 1 car 0 ... 12.497434 A 0 1 1 car 0 ... 117.161956 A 1 2 3 car 0 ... 57.793609 B 1 3 3 car 0 ... 308.302755 B 0 [4 rows x 10 columns] Label map : {0: 'car'}
By default, dataset which already feature an origin for its sample will retain it for further merges. Optionally, you can decide to overwrite the origin to the actual dataset that is being merged and forget the old origin.
>>> example3 = dummy_dataset( ... 2, 2, seed=2, label_map={0: "car"}, dataset_name="C" ... ) >>> merged_examples.merge(example3, mark_origin=True, ignore_index=True) Dataset object containing 6 images and 6 objects Name : A+B+C Images root : . Images : width height relative_path ... split origin origin_id id ... 0 560 955 care/suggest/determine/story.jpg ... train B 1 1 525 779 care/suggest/reach/marriage.jpg ... train B 0 2 673 342 such/serious/help/me.jpeg ... val A 1 3 865 560 such/serious/step/why.jpg ... train A 0 4 335 368 what/way/police/enter.jpeg ... train C 1 5 853 198 what/way/relationship/table.tiff ... train C 0 [6 rows x 7 columns] Annotations : image_id category_str category_id ... box_height origin origin_id id ... 0 1 car 0 ... 57.793609 B 1 1 1 car 0 ... 308.302755 B 0 2 3 car 0 ... 12.497434 A 0 3 3 car 0 ... 117.161956 A 1 4 4 car 0 ... 137.766169 C 1 5 5 car 0 ... 14.083247 C 0 [6 rows x 10 columns] Label map : {0: 'car'} >>> merged_examples.merge( ... example3, mark_origin=True, ignore_index=True, overwrite_origin=True ... ) Dataset object containing 6 images and 6 objects Name : A+B+C Images root : . Images : width height relative_path ... split origin origin_id id ... 0 560 955 care/suggest/determine/story.jpg ... train A+B 2 1 525 779 care/suggest/reach/marriage.jpg ... train A+B 3 2 673 342 such/serious/help/me.jpeg ... val A+B 0 3 865 560 such/serious/step/why.jpg ... train A+B 1 4 335 368 what/way/police/enter.jpeg ... train C 1 5 853 198 what/way/relationship/table.tiff ... train C 0 [6 rows x 7 columns] Annotations : image_id category_str category_id ... box_height origin origin_id id ... 0 1 car 0 ... 57.793609 A+B 2 1 1 car 0 ... 308.302755 A+B 3 2 3 car 0 ... 12.497434 A+B 0 3 3 car 0 ... 117.161956 A+B 1 4 4 car 0 ... 137.766169 C 1 5 5 car 0 ... 14.083247 C 0 [6 rows x 10 columns] Label map : {0: 'car'}