Booleanize columns of lists in you Dataset dataframes#
This notebooks shows how the booleanize and debooleanize methods can be used for easy attributes/tags filtering
Booleanization is the action of converting columns of list to a list of boolean columns. Each boolean column tells whether the element is present in the original list or not.
What’s more, it shows how the widget works to be able to choose between showing boolean values or list values.
[1]:
%load_ext autoreload
%autoreload 2
import warnings
import lours
from lours.dataset import from_caipy
from lours.utils.testing import assert_dataset_equal
warnings.simplefilter(action="ignore", category=FutureWarning)
Booleanization example#
Note on widget interface#
You can see by selecting the “Annotations” tab that you can chose to show the dataframes as booleanized or not, and with nested columns or not.
Don’t forget that under the hood, the columns are booleanized and not nested, here, it’s just for readability of the widget.
Automatic booleanization#
By default, when using from_caipy with use_schema set to True, it booleanizes the dataset.
See more info about advanced parsing with caipy and schemas in the related tutorial : Demo schemas
[2]:
from_caipy(
"../../test_lours/test_data/caipy_dataset/tags/small_tagged_dataset/",
use_schema=True,
booleanize=True,
)
Also, note that by default, columns are shown as they appear on the dataframe, i.e. raw columns with boolean value, but the default display option can be changed in the lours.utils module.
[3]:
lours.utils.DISPLAY_NESTED_COLUMNS = True
lours.utils.DISPLAY_UNBOOLEANIZED = True
from_caipy(
"../../test_lours/test_data/caipy_dataset/tags/small_tagged_dataset/",
use_schema=True,
booleanize=True,
)
Manual Booleanization#
If you select booleanize=False when loading with from_caipy, you will keep the item column. To booleanize it manually, you can call the method .booleanize Make sure that the column names you give to that method are only composed of iterables in each cell (be it set or list)
[4]:
dataset = from_caipy(
"../../test_lours/test_data/caipy_dataset/tags/small_tagged_dataset/",
json_schema="default",
booleanize=False,
)
[5]:
dataset
[6]:
booleanized = dataset.booleanize("attributes.colors")
booleanized
Working with booleanized data#
See how simpler it is to filter annotations based on attributes.colors :
In this example, we are interested into keeping annotations that have “red” in their “colors”
With a regular dataset, you will need to call the very inefficient .apply method.
[7]:
dataset.loc_annot[dataset.annotations["attributes.colors"].apply(lambda x: "red" in x)]
With a booleanized dataset, you can directly call the .loc_annot method with the attributes.colors.red column.
[8]:
booleanized.loc_annot[booleanized.annotations["attributes.colors.red"]]
Debooleanization#
Although the booleanized columns are dropped in favor of the boolean ones, we keep track of them in a special attribute Dataset.booleanized_columns
As such, we can use the debooleanize method to get back to the original dataset. Note that this method has to be used for several io methods :
to_caipyto_caipy_genericto_cocoto_fiftyone
[9]:
debool = booleanized.debooleanize()
assert_dataset_equal(debool, dataset)
debool