Schemas mechanics#

The Caipy loader features a way to parse the caipyjson and efficiently transform into column-wise data which is more compatible with pandas.

The schemas will help check caipy json structure, automatically normalize the data so that it’s not nested anymore, and will booleanize the sets. See related tutorial about booleanization here : Demo booleanization

What is a schemas ?#

A schemas is way to specify data structure. For each key of an object is given its type. It can be a string, a float, to more complexe types like lists and objects themselves. That ways you can specify how should the data be nested.

See more info about json schemas in the official documentation

This library provides a default schema but you can provide your own schema as well with a path or a url.

For a better readability, we use mercury’s JSON displayer, but you can simply replace mr.JSON with display for every shown dictionary

[1]:
%%capture

%load_ext autoreload
%autoreload 2
import json

import mercury as mr

from lours.dataset.io.schema_util import load_json_schema

app = mr.App(title="Display notebook", static_notebook=True)
[2]:
default_caipy_schema = load_json_schema("default")
# Show the json with mercury, for better readability
mr.JSON(default_caipy_schema)

The interesting types for us are

  • enum which can be converted to pandas categorical

  • array, with the uniqueItems set to True, this can be seen as an unordered set and thus can be booleanized

  • object this tells us that data is nested and thus need to be normalized. For example, the weather tag for images is inside the tags object. In the images dataframe, this will go in the tags.weather column.

In the future, we might have to deal with array object that are not ordered sets. They can be converted to categorical data within the columns array.1, array.2 etc, but there is no support for it for the moment.

Data checking#

The first obvious use of schemas is for validation.

For example the following data structure is rejected because the value custom_dict["annotations"][0]["attributes"]["colors"] is set to turquoise while it must be on of the following values, that are specified in the schema: custom_caipy_schema["properties"]["annotations"]["items"]["properties"]["attributes"]["properties"]["colors"]["items"]["enum"]

[3]:
with open("../../test_lours/test_data/caipy_dataset/tags/custom_schema/785.json") as f:
    custom_dict = json.load(f)

mr.JSON(custom_dict)

mr.JSON(
    default_caipy_schema["properties"]["annotations"]["items"]["properties"][
        "attributes"
    ]["properties"]["colors"]["items"]["enum"]
)
[ ]:
from jsonschema_rs import validator_for

validator = validator_for(default_caipy_schema)

validator.validate(custom_dict)
---------------------------------------------------------------------------
ValidationError                           Traceback (most recent call last)
Cell In[4], line 5
      1 from jsonschema_rs import JSONSchema
      3 validator = JSONSchema(default_caipy_schema)
----> 5 validator.validate(custom_dict)

ValidationError: "turquoise" is not one of ["red","green","yellow","blue","white","black","orange","purple","grey","brown","pink","beige","cyan"]

Failed validating "enum" in schema["properties"]["annotations"]["items"]["properties"]["attributes"]["properties"]["colors"]["items"]

On instance["annotations"][0]["attributes"]["colors"][1]:
    "turquoise"

list and enum formatting#

As mentioned above, we can use the schemas to construct dataframe with the right columns and dtypes, even if some values are not present.

A regular caveat is the mix of fields that are present in some annotations but not in others. Pandas deals with missing values with NaN and None, but we can replace them with the right default value. For example, if we know that a field is a list, we can give it the empty list as default value.

[5]:
from lours.dataset.io.schema_util import (
    fill_with_dtypes_and_default_value,
    get_enums,
    get_remapping_dict_from_schema,
)

image_schema = default_caipy_schema["properties"]["image"]
annotations_schema = default_caipy_schema["properties"]["annotations"]["items"]

To better understand the flattening of the data, we can look at some utility function with schemas.

get_remapping_dict_from_schema function will construct a nested dict where the values are the column name destination.

get_enums will search for arrays with unique items and retrieve all possible values. This will be used to construct boolean columns which tell us for each enum if it was in the original list for this very row.

[6]:
mr.JSON(get_remapping_dict_from_schema(image_schema))
[7]:
enums = get_enums(annotations_schema)
# convert sets to list for json serialization
mr.JSON({k: list(v) for k, v in enums.items()})

In the following cells, we use the caipy loader with and without the default schema. Notice how in the the annotations data of image1, there is no “position” in the dictionary.

If we load the caipy without schema, we can flatten the json data, thanks to pandas.json_normalize, but the missing data will be set to None, while it should be an empty list.

Using the schema can help setting the right default value

[8]:
from lours.dataset import from_caipy

with open(
    "../../test_lours/test_data/caipy_dataset/tags/small_tagged_dataset/Annotations/image1.json"
) as f:
    caipy_json = json.load(f)
mr.JSON(caipy_json["annotations"][0])
with open(
    "../../test_lours/test_data/caipy_dataset/tags/small_tagged_dataset/Annotations/image2.json"
) as f:
    caipy_json = json.load(f)
mr.JSON(caipy_json["annotations"][0])
[9]:
no_schema_dataset = from_caipy(
    "../../test_lours/test_data/caipy_dataset/tags/small_tagged_dataset/",
    use_schema=False,
)

schema_dataset = from_caipy(
    "../../test_lours/test_data/caipy_dataset/tags/small_tagged_dataset/",
    use_schema=True,
    booleanize=False,
)
[10]:
no_schema_dataset.annotations
[10]:
image_id category_str category_id box_x_min box_y_min box_width box_height area attributes.colors attributes.occluded attributes.position
id
269791 6091 stop sign 13 100.22 117.54 253.43 274.90 46726.4303 [red, white] True None
1161234 10395 teddy bear 88 49.19 66.20 378.97 379.68 84587.4391 [grey] None [front]
[11]:
schema_dataset.annotations
[11]:
image_id category_str category_id box_x_min box_y_min box_width box_height area attributes.colors attributes.occluded attributes.position
id
269791 6091 stop sign 13 100.22 117.54 253.43 274.90 46726.4303 [red, white] True []
1161234 10395 teddy bear 88 49.19 66.20 378.97 379.68 84587.4391 [grey] None [front]

Note that thanks to schema tool fill_with_default_value we can put default values afterward.

[12]:
fill_with_dtypes_and_default_value(annotations_schema, schema_dataset.annotations)
[12]:
image_id category_str category_id box_x_min box_y_min box_width box_height area attributes.colors attributes.occluded attributes.position
id
269791 6091 stop sign 13 100.22 117.54 253.43 274.90 46726.4303 [red, white] True []
1161234 10395 teddy bear 88 49.19 66.20 378.97 379.68 84587.4391 [grey] None [front]

From dataframe to nested json#

Once you have manipulated your dataset you can then re-save it according to the schema.

[13]:
from lours.dataset.io.schema_util import remap_dict

flat_dict = schema_dataset.annotations.iloc[0].to_dict()
mr.JSON(flat_dict)

nested_dict = remap_dict(flat_dict, get_remapping_dict_from_schema(annotations_schema))
mr.JSON(nested_dict)

Using a custom schema#

If you have custom data that you want to work with, you can give your own json schema instead of the ones provided by the official package.

The given schema to from_caipy and from_caipy_generic can be either a path to a json or directly a dictionary.

In the following schema, the value “turquoise” is now considered as a valid value in image’s spectrum. Also, the possible values for list items for annotations colors (annotation["attributes"]["colors"]) and annotations actions (annotations["attributes"]["actions"]) and reduced to only 2 possible items each. Respectively “blue” and “white” for colors, and “sitting” and “laying” for actions.

[14]:
with open("../../test_lours/test_data/caipy_dataset/tags/custom_schema.json") as f:
    custom_schema = json.load(f)

mr.JSON(custom_schema)
[15]:
from lours.dataset import from_caipy_generic

As mentioned above, this cell will fail because by default caipy expects the CA-V5.b schema

[16]:
dataset = from_caipy_generic(
    images_folder=None,
    annotations_folder="../../test_lours/test_data/caipy_dataset/tags/custom_schema/",
    use_schema=True,
)
specifying a fictive path for images : ../../test_lours/test_data/caipy_dataset/tags/Images
---------------------------------------------------------------------------
ValidationError                           Traceback (most recent call last)
Cell In[16], line 1
----> 1 dataset = from_caipy_generic(
      2     images_folder=None,
      3     annotations_folder="../../test_lours/test_data/caipy_dataset/tags/custom_schema/",
      4     use_schema=True,
      5 )

File ~/workspace/Bamboo/lours/dataset/io/caipy.py:326, in from_caipy_generic(images_folder, annotations_folder, dataset_name, split, splits_to_read, use_schema, json_schema, booleanize)
    323     dataset += split_dataset
    325 if len(dataset) == 0 and splits_to_read is None:
--> 326     dataset = load_caipy_split(
    327         images_folder=images_folder,
    328         annotations_folder=annotations_folder,
    329         dataset_name=dataset_name,
    330         split_name=split,
    331         schema=schema,
    332     )
    334 if schema is not None:
    335     image_schema = schema["properties"]["image"]

File ~/workspace/Bamboo/lours/dataset/io/caipy.py:132, in load_caipy_split(images_folder, annotations_folder, dataset_name, split_name, schema)
    105 def load_caipy_split(
    106     images_folder: Path,
    107     annotations_folder: Path,
   (...)
    110     schema: dict | None = None,
    111 ) -> Dataset:
    112     """Load a particular caipy split folder and convert it to a lours Dataset
    113
    114     Args:
   (...)
    130         caipy splits
    131     """
--> 132     images, annotations = load_caipy_annot_folder(annotations_folder, schema)
    133     if images is not None:
    134         if not images.index.is_unique:

File ~/workspace/Bamboo/lours/dataset/io/caipy.py:55, in load_caipy_annot_folder(folder_path, schema)
     53     frame_data = json.load(f)
     54 if validator is not None:
---> 55     validator.validate(frame_data)
     56 if "type" in frame_data.keys():
     57     assert (
     58         frame_data["type"] == "instances"
     59     ), "Only instance type supported for now"

ValidationError: "turquoise" is not one of ["red","green","yellow","blue","white","black","orange","purple","grey","brown","pink","beige","cyan"]

Failed validating "enum" in schema["properties"]["annotations"]["items"]["properties"]["attributes"]["properties"]["colors"]["items"]

On instance["annotations"][0]["attributes"]["colors"][1]:
    "turquoise"

This one will succeed

Also note that booleanize columns are less numerous for attributes.actions and attributes.colors

[17]:
dataset = from_caipy_generic(
    images_folder=None,
    annotations_folder="../../test_lours/test_data/caipy_dataset/tags/custom_schema/",
    use_schema=True,
    json_schema=custom_schema,
)
specifying a fictive path for images : ../../test_lours/test_data/caipy_dataset/tags/Images
[18]:
dataset