Booleanize columns of lists in you Dataset dataframes#

This notebooks shows how the booleanize and debooleanize methods can be used for easy attributes/tags filtering

Booleanization is the action of converting columns of list to a list of boolean columns. Each boolean column tells whether the element is present in the original list or not.

What’s more, it shows how the widget works to be able to choose between showing boolean values or list values.

[1]:

%load_ext autoreload

%autoreload 2
import warnings

import lours
from lours.dataset import from_caipy
from lours.utils.testing import assert_dataset_equal

warnings.simplefilter(action="ignore", category=FutureWarning)

Booleanization example#

Automatic booleanization#

By default, when using from_caipy with use_schema set to True, it booleanizes the dataset.

See more info about advanced parsing with caipy and schemas in the related tutorial : Demo schemas

[2]:

from_caipy(
    "../../test_lours/test_data/caipy_dataset/tags/small_tagged_dataset/",
    use_schema=True,
    booleanize=True,
)

Also, note that by default, columns are shown as they appear on the dataframe, i.e. raw columns with boolean value, but the default display option can be changed in the lours.utils module.

[3]:

lours.utils.DISPLAY_NESTED_COLUMNS = True
lours.utils.DISPLAY_UNBOOLEANIZED = True
from_caipy(
    "../../test_lours/test_data/caipy_dataset/tags/small_tagged_dataset/",
    use_schema=True,
    booleanize=True,
)

Manual Booleanization#

If you select booleanize=False when loading with from_caipy, you will keep the item column. To booleanize it manually, you can call the method .booleanize Make sure that the column names you give to that method are only composed of iterables in each cell (be it set or list)

[4]:

dataset = from_caipy(
    "../../test_lours/test_data/caipy_dataset/tags/small_tagged_dataset/",
    json_schema="default",
    booleanize=False,
)

[5]:

dataset

[6]:

booleanized = dataset.booleanize("attributes.colors")
booleanized

Working with booleanized data#

See how simpler it is to filter annotations based on attributes.colors :

In this example, we are interested into keeping annotations that have “red” in their “colors”

With a regular dataset, you will need to call the very inefficient .apply method.

[7]:

dataset.loc_annot[dataset.annotations["attributes.colors"].apply(lambda x: "red" in x)]

With a booleanized dataset, you can directly call the .loc_annot method with the attributes.colors.red column.

[8]:

booleanized.loc_annot[booleanized.annotations["attributes.colors.red"]]

Debooleanization#

Although the booleanized columns are dropped in favor of the boolean ones, we keep track of them in a special attribute Dataset.booleanized_columns

As such, we can use the debooleanize method to get back to the original dataset. Note that this method has to be used for several io methods :

to_caipy
to_caipy_generic
to_coco
to_fiftyone

[9]:

debool = booleanized.debooleanize()
assert_dataset_equal(debool, dataset)
debool