merge#

Dataset.merge(other: Dataset, allow_overlapping_image_ids: bool = True, realign_label_map: bool = False, ignore_index: bool = False, mark_origin: bool = False, overwrite_origin: bool = False) → Dataset[source]#

Merge two datasets and return a unique dataset object containing Samples from both. Result’s images_root will be the common path of both datasets, and the image relative paths will be updated accordingly. Result’s label map will be the superset of both label map, provided one is included in the other.

Notes

This function is also usable with the + operator
If possible, booleanized columns for images and annotations will be broadcast together. See lours.utils.column_booleanizer.broadcast_booleanization()
If one of the dataset has an absolute path as images_root, the other dataset images root path will also be converted to absolute.
If both datasets have the same name, the output will have the same name as well.
If datasets have a different name, the output will have the concatenation of both names separate by a “+” sign. The merge output of “A” and “B” will be thus names “A+B”.
If one dataset has no name (dataset.name is None), the output will take the name of the other.
If mark_origin is selected, it will be effective only if datasets have different actual names (not None)

Parameters:

other – Other dataset to merge with. This dataset must be compatible with the first one, i.e. one label map is included with the other, and image and annotation ids are mutually exclusives between datasets (unless ignore_index is False)
allow_overlapping_image_ids – if set to True, will try to join images dataframes with overlapping ids. The whole rows (i.e. with values from columns present in both dataframes) must match, as well as the images_root. In that case, annotations with this image_id (from self or other) will be assumed to come from the same image. Defaults to True
realign_label_map – If set to True, will try to remap classes of other dataset to match this dataset’s label map, to avoid a potential error due to incompatible label maps.
ignore_index – if set to True, will ignore overlapping ids for images and annotations and reset them. Will update the image_id column in the annotations accordingly. Note that this option makes the former option useless. Defaults to False.
mark_origin – If set to True, and if both datasets have a different name, will add two columns “origin” and “origin_id” for images and annotations dataframes, indicating respectively the name of the origin dataset, and its id in the original dataset. Defaults to True.
overwrite_origin – If set to True, will overwrite already existing columns in input datasets dataframes. Otherwise, will only mark origin if it’s not present. Defaults to False.

Raises:

ValueError – Error if the two datasets are incompatible (see above)

Returns:

Merged dataset.

See also

Example

>>> from lours.utils.doc_utils import dummy_dataset
>>> example1 = dummy_dataset(2, 2, seed=0)
>>> example2 = dummy_dataset(2, 2, seed=1)
>>> example1
Dataset object containing 2 images and 2 objects
Name :
    inside_else_memory
Images root :
    such/serious
Images :
    width  height      relative_path   type  split
id
0     342     136       help/me.jpeg  .jpeg  train
1     377     167  whatever/wait.png   .png  train
Annotations :
    image_id category_str  category_id  ...  box_y_min   box_width  box_height
id                                      ...
0          0         step           15  ...  73.932999   71.552480   42.673983
1          0          why           19  ...   4.567638  248.551257  122.602211

[2 rows x 8 columns]
Label map :
{15: 'step', 19: 'why', 25: 'interview'}
>>> example2
Dataset object containing 2 images and 2 objects
Name :
    shake_effort_many
Images root :
    care/suggest
Images :
    width  height        relative_path  type  split
id
0     955     229  determine/story.jpg  .jpg  train
1     131     840       air/method.bmp  .bmp  train
Annotations :
    image_id category_str  category_id  ...   box_y_min   box_width  box_height
id                                      ...
0          1       listen           14  ...  276.974642    9.718823  184.684056
1          0        reach           22  ...    6.311037  123.141689  174.239136

[2 rows x 8 columns]
Label map :
{14: 'listen', 15: 'marriage', 22: 'reach'}

Notice how the two label maps have overlapping index (the id 15)

>>> example1 + example2
Using the following class remapping dictionary :
{14: 14, 15: 16, 22: 22}
Dataset object containing 4 images and 4 objects
Name :
    inside_else_memory+shake_effort_many
Images root :
    .
Images :
    width  height                     relative_path   type  split
id
0     342     136         such/serious/help/me.jpeg  .jpeg  train
1     377     167    such/serious/whatever/wait.png   .png  train
2     131     840       care/suggest/air/method.bmp   .bmp  train
3     955     229  care/suggest/determine/story.jpg   .jpg  train
Annotations :
    image_id category_str  category_id  ...   box_y_min   box_width  box_height
id                                      ...
0          0         step           15  ...   73.932999   71.552480   42.673983
1          0          why           19  ...    4.567638  248.551257  122.602211
2          2       listen           14  ...  276.974642    9.718823  184.684056
3          3        reach           22  ...    6.311037  123.141689  174.239136

[4 rows x 8 columns]
Label map :
{14: 'listen',
 15: 'step',
 16: 'marriage',
 19: 'why',
 22: 'reach',
 25: 'interview'}

>>> example1.merge(example2, realign_label_map=False)
Traceback (most recent call last):
    ...
lours.utils.label_map_merger.IncompatibleLabelMapsError: Label maps are incompatible

>>> example1.merge(
...     example2, realign_label_map=True, allow_overlapping_image_ids=False
... )
Traceback (most recent call last):
    ...
ValueError: Overlapping image ids not permitted. Consider using the allow_overlapping_image_ids or ignore_index options

This will raise an error because overlapping image ids is possible only if the rows are compatible : fields that are present in both rows have the same value

>>> example1.merge(
...     example2, realign_label_map=True, allow_overlapping_image_ids=True
... )
Traceback (most recent call last):
    ...
AssertionError: sub-Dataframes constructed from ids and columns in both DataFrames are not equal.

The only way to merge these datasets is to remap the label map and then reset the indexes with the option ignore_index set to True, similar to pandas.concat().

>>> example1.merge(
...     example2.remap_classes({15: 1}, remove_not_mapped=False),
...     ignore_index=True,
... )
Dataset object containing 4 images and 4 objects
Name :
    inside_else_memory+shake_effort_many
Images root :
    .
Images :
    width  height                     relative_path   type  split
id
0     342     136         such/serious/help/me.jpeg  .jpeg  train
1     377     167    such/serious/whatever/wait.png   .png  train
2     131     840       care/suggest/air/method.bmp   .bmp  train
3     955     229  care/suggest/determine/story.jpg   .jpg  train
Annotations :
    image_id category_str  category_id  ...   box_y_min   box_width  box_height
id                                      ...
0          0         step           15  ...   73.932999   71.552480   42.673983
1          0          why           19  ...    4.567638  248.551257  122.602211
2          2       listen           14  ...  276.974642    9.718823  184.684056
3          3        reach           22  ...    6.311037  123.141689  174.239136

[4 rows x 8 columns]
Label map :
{1: 'marriage',
14: 'listen',
15: 'step',
19: 'why',
22: 'reach',
25: 'interview'}

Let’s construct two datasets sharing image info and label maps

>>> example = dummy_dataset(5, 5, seed=0)
>>> example1 = example.iloc_annot[::2].iloc[1:]
>>> example2 = example.iloc_annot[1::2].iloc[:-1]

>>> example1
Dataset object containing 4 images and 3 objects
Name :
    inside_else_memory
Images root :
    such/serious
Images :
    width  height           relative_path   type  split
id
1     377     831       whatever/wait.png   .png  train
2     136     684        chair/mother.gif   .gif  train
3     167     921  someone/challenge.jpeg  .jpeg  train
4     114     553  successful/present.bmp   .bmp  train
Annotations :
    image_id category_str  category_id  ...   box_y_min  box_width  box_height
id                                      ...
0          3          why           19  ...  498.685784  31.192237  404.663563
2          3    interview           25  ...  389.294931  19.083146  209.778063
4          2         step           15  ...   85.009761  18.228218  181.012493

[3 rows x 8 columns]
Label map :
{15: 'step', 19: 'why', 25: 'interview'}
>>> example2
Dataset object containing 4 images and 1 object
Name :
    inside_else_memory
Images root :
    such/serious
Images :
    width  height           relative_path   type  split
id
0     342     257            help/me.jpeg  .jpeg  train
1     377     831       whatever/wait.png   .png  train
2     136     684        chair/mother.gif   .gif  train
3     167     921  someone/challenge.jpeg  .jpeg  train
Annotations :
    image_id category_str  category_id  ...  box_y_min  box_width  box_height
id                                      ...
3          3         step           15  ...  26.082417  34.739663  607.977022

[1 rows x 8 columns]
Label map :
{15: 'step', 19: 'why', 25: 'interview'}
>>> example1.merge(example2)
Dataset object containing 5 images and 4 objects
Name :
    inside_else_memory
Images root :
    such/serious
Images :
    width  height           relative_path   type  split
id
1     377     831       whatever/wait.png   .png  train
2     136     684        chair/mother.gif   .gif  train
3     167     921  someone/challenge.jpeg  .jpeg  train
4     114     553  successful/present.bmp   .bmp  train
0     342     257            help/me.jpeg  .jpeg  train
Annotations :
    image_id category_str  category_id  ...   box_y_min  box_width  box_height
id                                      ...
0          3          why           19  ...  498.685784  31.192237  404.663563
2          3    interview           25  ...  389.294931  19.083146  209.778063
4          2         step           15  ...   85.009761  18.228218  181.012493
3          3         step           15  ...   26.082417  34.739663  607.977022

[4 rows x 8 columns]
Label map :
{15: 'step', 19: 'why', 25: 'interview'}

See that if we use the ignore_index option, the images are duplicated because it is assumed the two images dataframes don’t have any overlap.

>>> example1.merge(example2, ignore_index=True)
Dataset object containing 8 images and 4 objects
Name :
    inside_else_memory
Images root :
    such/serious
Images :
    width  height           relative_path   type  split
id
0     136     684        chair/mother.gif   .gif  train
1     167     921  someone/challenge.jpeg  .jpeg  train
2     114     553  successful/present.bmp   .bmp  train
3     377     831       whatever/wait.png   .png  train
4     136     684        chair/mother.gif   .gif  train
5     342     257            help/me.jpeg  .jpeg  train
6     167     921  someone/challenge.jpeg  .jpeg  train
7     377     831       whatever/wait.png   .png  train
Annotations :
    image_id category_str  category_id  ...   box_y_min  box_width  box_height
id                                      ...
0          0         step           15  ...   85.009761  18.228218  181.012493
1          1          why           19  ...  498.685784  31.192237  404.663563
2          1    interview           25  ...  389.294931  19.083146  209.778063
3          6         step           15  ...   26.082417  34.739663  607.977022

[4 rows x 8 columns]
Label map :
{15: 'step', 19: 'why', 25: 'interview'}

Finally, you can mark the origin of your datasets in dedicated columns in the resulting dataset’s dataframes.

>>> example1 = dummy_dataset(
...     2, 2, seed=0, label_map={0: "car"}, dataset_name="A"
... )
>>> example2 = dummy_dataset(
...     2, 2, seed=1, label_map={0: "car"}, dataset_name="B"
... )
>>> example1
Dataset object containing 2 images and 2 objects
Name :
    A
Images root :
    such/serious
Images :
    width  height relative_path   type  split
id
0     865     560  step/why.jpg   .jpg  train
1     673     342  help/me.jpeg  .jpeg    val
Annotations :
    image_id category_str  category_id  ...   box_y_min   box_width  box_height
id                                      ...
0          0          car            0  ...  511.143123  616.718121   12.497434
1          0          car            0  ...  339.716034  233.243139  117.161956

[2 rows x 8 columns]
Label map :
{0: 'car'}
>>> example2
Dataset object containing 2 images and 2 objects
Name :
    B
Images root :
    care/suggest
Images :
    width  height        relative_path  type  split
id
0     525     779   reach/marriage.jpg  .jpg  train
1     560     955  determine/story.jpg  .jpg  train
Annotations :
    image_id category_str  category_id  ...   box_y_min   box_width  box_height
id                                      ...
0          0          car            0  ...   21.468549  283.211413  308.302755
1          0          car            0  ...  586.986712  124.825174   57.793609

[2 rows x 8 columns]
Label map :
{0: 'car'}
>>> merged_examples = example1.merge(
...     example2, mark_origin=True, ignore_index=True
... )
>>> merged_examples
Dataset object containing 4 images and 4 objects
Name :
    A+B
Images root :
    .
Images :
    width  height                     relative_path  ...  split origin origin_id
id                                                   ...
0     673     342         such/serious/help/me.jpeg  ...    val      A         1
1     865     560         such/serious/step/why.jpg  ...  train      A         0
2     560     955  care/suggest/determine/story.jpg  ...  train      B         1
3     525     779   care/suggest/reach/marriage.jpg  ...  train      B         0

[4 rows x 7 columns]
Annotations :
    image_id category_str  category_id  ...  box_height  origin  origin_id
id                                      ...
0          1          car            0  ...   12.497434       A          0
1          1          car            0  ...  117.161956       A          1
2          3          car            0  ...   57.793609       B          1
3          3          car            0  ...  308.302755       B          0

[4 rows x 10 columns]
Label map :
{0: 'car'}

By default, dataset which already feature an origin for its sample will retain it for further merges. Optionally, you can decide to overwrite the origin to the actual dataset that is being merged and forget the old origin.

>>> example3 = dummy_dataset(
...     2, 2, seed=2, label_map={0: "car"}, dataset_name="C"
... )
>>> merged_examples.merge(example3, mark_origin=True, ignore_index=True)
Dataset object containing 6 images and 6 objects
Name :
    A+B+C
Images root :
    .
Images :
    width  height                     relative_path  ...  split origin origin_id
id                                                   ...
0     560     955  care/suggest/determine/story.jpg  ...  train      B         1
1     525     779   care/suggest/reach/marriage.jpg  ...  train      B         0
2     673     342         such/serious/help/me.jpeg  ...    val      A         1
3     865     560         such/serious/step/why.jpg  ...  train      A         0
4     335     368        what/way/police/enter.jpeg  ...  train      C         1
5     853     198  what/way/relationship/table.tiff  ...  train      C         0

[6 rows x 7 columns]
Annotations :
    image_id category_str  category_id  ...  box_height  origin  origin_id
id                                      ...
0          1          car            0  ...   57.793609       B          1
1          1          car            0  ...  308.302755       B          0
2          3          car            0  ...   12.497434       A          0
3          3          car            0  ...  117.161956       A          1
4          4          car            0  ...  137.766169       C          1
5          5          car            0  ...   14.083247       C          0

[6 rows x 10 columns]
Label map :
{0: 'car'}
>>> merged_examples.merge(
...     example3, mark_origin=True, ignore_index=True, overwrite_origin=True
... )
Dataset object containing 6 images and 6 objects
Name :
    A+B+C
Images root :
    .
Images :
    width  height                     relative_path  ...  split origin origin_id
id                                                   ...
0     560     955  care/suggest/determine/story.jpg  ...  train    A+B         2
1     525     779   care/suggest/reach/marriage.jpg  ...  train    A+B         3
2     673     342         such/serious/help/me.jpeg  ...    val    A+B         0
3     865     560         such/serious/step/why.jpg  ...  train    A+B         1
4     335     368        what/way/police/enter.jpeg  ...  train      C         1
5     853     198  what/way/relationship/table.tiff  ...  train      C         0

[6 rows x 7 columns]
Annotations :
    image_id category_str  category_id  ...  box_height  origin  origin_id
id                                      ...
0          1          car            0  ...   57.793609     A+B          2
1          1          car            0  ...  308.302755     A+B          3
2          3          car            0  ...   12.497434     A+B          0
3          3          car            0  ...  117.161956     A+B          1
4          4          car            0  ...  137.766169       C          1
5          5          car            0  ...   14.083247       C          0

[6 rows x 10 columns]
Label map :
{0: 'car'}