Schemas mechanics#
The Caipy loader features a way to parse the caipyjson and efficiently transform into column-wise data which is more compatible with pandas.
The schemas will help check caipy json structure, automatically normalize the data so that it’s not nested anymore, and will booleanize the sets. See related tutorial about booleanization here : Demo booleanization
What is a schemas ?#
A schemas is way to specify data structure. For each key of an object is given its type. It can be a string, a float, to more complexe types like lists and objects themselves. That ways you can specify how should the data be nested.
See more info about json schemas in the official documentation
This library provides a default schema but you can provide your own schema as well with a path or a url.
For a better readability, we use mercury’s JSON displayer, but you can simply replace mr.JSON with display for every shown dictionary
[1]:
%%capture
%load_ext autoreload
%autoreload 2
import json
import mercury as mr
from lours.dataset.io.schema_util import load_json_schema
app = mr.App(title="Display notebook", static_notebook=True)
[2]:
default_caipy_schema = load_json_schema("default")
# Show the json with mercury, for better readability
mr.JSON(default_caipy_schema)
The interesting types for us are
enumwhich can be converted to pandas categoricalarray, with theuniqueItemsset toTrue, this can be seen as an unordered set and thus can be booleanizedobjectthis tells us that data is nested and thus need to be normalized. For example, theweathertag for images is inside thetagsobject. In the images dataframe, this will go in thetags.weathercolumn.
In the future, we might have to deal with array object that are not ordered sets. They can be converted to categorical data within the columns array.1, array.2 etc, but there is no support for it for the moment.
Data checking#
The first obvious use of schemas is for validation.
For example the following data structure is rejected because the value custom_dict["annotations"][0]["attributes"]["colors"] is set to turquoise while it must be on of the following values, that are specified in the schema: custom_caipy_schema["properties"]["annotations"]["items"]["properties"]["attributes"]["properties"]["colors"]["items"]["enum"]
[3]:
with open("../../test_lours/test_data/caipy_dataset/tags/custom_schema/785.json") as f:
custom_dict = json.load(f)
mr.JSON(custom_dict)
mr.JSON(
default_caipy_schema["properties"]["annotations"]["items"]["properties"][
"attributes"
]["properties"]["colors"]["items"]["enum"]
)
[4]:
from jsonschema_rs import JSONSchema
validator = JSONSchema(default_caipy_schema)
validator.validate(custom_dict)
---------------------------------------------------------------------------
ValidationError Traceback (most recent call last)
Cell In[4], line 5
1 from jsonschema_rs import JSONSchema
3 validator = JSONSchema(default_caipy_schema)
----> 5 validator.validate(custom_dict)
ValidationError: "turquoise" is not one of ["red","green","yellow","blue","white","black","orange","purple","grey","brown","pink","beige","cyan"]
Failed validating "enum" in schema["properties"]["annotations"]["items"]["properties"]["attributes"]["properties"]["colors"]["items"]
On instance["annotations"][0]["attributes"]["colors"][1]:
"turquoise"
list and enum formatting#
As mentioned above, we can use the schemas to construct dataframe with the right columns and dtypes, even if some values are not present.
A regular caveat is the mix of fields that are present in some annotations but not in others. Pandas deals with missing values with NaN and None, but we can replace them with the right default value. For example, if we know that a field is a list, we can give it the empty list as default value.
[5]:
from lours.dataset.io.schema_util import (
fill_with_dtypes_and_default_value,
get_enums,
get_remapping_dict_from_schema,
)
image_schema = default_caipy_schema["properties"]["image"]
annotations_schema = default_caipy_schema["properties"]["annotations"]["items"]
To better understand the flattening of the data, we can look at some utility function with schemas.
get_remapping_dict_from_schema function will construct a nested dict where the values are the column name destination.
get_enums will search for arrays with unique items and retrieve all possible values. This will be used to construct boolean columns which tell us for each enum if it was in the original list for this very row.
[6]:
mr.JSON(get_remapping_dict_from_schema(image_schema))
[7]:
enums = get_enums(annotations_schema)
# convert sets to list for json serialization
mr.JSON({k: list(v) for k, v in enums.items()})
In the following cells, we use the caipy loader with and without the default schema. Notice how in the the annotations data of image1, there is no “position” in the dictionary.
If we load the caipy without schema, we can flatten the json data, thanks to pandas.json_normalize, but the missing data will be set to None, while it should be an empty list.
Using the schema can help setting the right default value
[8]:
from lours.dataset import from_caipy
with open(
"../../test_lours/test_data/caipy_dataset/tags/small_tagged_dataset/Annotations/image1.json"
) as f:
caipy_json = json.load(f)
mr.JSON(caipy_json["annotations"][0])
with open(
"../../test_lours/test_data/caipy_dataset/tags/small_tagged_dataset/Annotations/image2.json"
) as f:
caipy_json = json.load(f)
mr.JSON(caipy_json["annotations"][0])
[9]:
no_schema_dataset = from_caipy(
"../../test_lours/test_data/caipy_dataset/tags/small_tagged_dataset/",
use_schema=False,
)
schema_dataset = from_caipy(
"../../test_lours/test_data/caipy_dataset/tags/small_tagged_dataset/",
use_schema=True,
booleanize=False,
)
[10]:
no_schema_dataset.annotations
[10]:
| image_id | category_str | category_id | box_x_min | box_y_min | box_width | box_height | area | attributes.colors | attributes.occluded | attributes.position | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| id | |||||||||||
| 269791 | 6091 | stop sign | 13 | 100.22 | 117.54 | 253.43 | 274.90 | 46726.4303 | [red, white] | True | None |
| 1161234 | 10395 | teddy bear | 88 | 49.19 | 66.20 | 378.97 | 379.68 | 84587.4391 | [grey] | None | [front] |
[11]:
schema_dataset.annotations
[11]:
| image_id | category_str | category_id | box_x_min | box_y_min | box_width | box_height | area | attributes.colors | attributes.occluded | attributes.position | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| id | |||||||||||
| 269791 | 6091 | stop sign | 13 | 100.22 | 117.54 | 253.43 | 274.90 | 46726.4303 | [red, white] | True | [] |
| 1161234 | 10395 | teddy bear | 88 | 49.19 | 66.20 | 378.97 | 379.68 | 84587.4391 | [grey] | None | [front] |
Note that thanks to schema tool fill_with_default_value we can put default values afterward.
[12]:
fill_with_dtypes_and_default_value(annotations_schema, schema_dataset.annotations)
[12]:
| image_id | category_str | category_id | box_x_min | box_y_min | box_width | box_height | area | attributes.colors | attributes.occluded | attributes.position | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| id | |||||||||||
| 269791 | 6091 | stop sign | 13 | 100.22 | 117.54 | 253.43 | 274.90 | 46726.4303 | [red, white] | True | [] |
| 1161234 | 10395 | teddy bear | 88 | 49.19 | 66.20 | 378.97 | 379.68 | 84587.4391 | [grey] | None | [front] |
From dataframe to nested json#
Once you have manipulated your dataset you can then re-save it according to the schema.
[13]:
from lours.dataset.io.schema_util import remap_dict
flat_dict = schema_dataset.annotations.iloc[0].to_dict()
mr.JSON(flat_dict)
nested_dict = remap_dict(flat_dict, get_remapping_dict_from_schema(annotations_schema))
mr.JSON(nested_dict)
Using a custom schema#
If you have custom data that you want to work with, you can give your own json schema instead of the ones provided by the official package.
The given schema to from_caipy and from_caipy_generic can be either a path to a json or directly a dictionary.
In the following schema, the value “turquoise” is now considered as a valid value in image’s spectrum. Also, the possible values for list items for annotations colors (annotation["attributes"]["colors"]) and annotations actions (annotations["attributes"]["actions"]) and reduced to only 2 possible items each. Respectively “blue” and “white” for colors, and “sitting” and “laying” for actions.
[14]:
with open("../../test_lours/test_data/caipy_dataset/tags/custom_schema.json") as f:
custom_schema = json.load(f)
mr.JSON(custom_schema)
[15]:
from lours.dataset import from_caipy_generic
As mentioned above, this cell will fail because by default caipy expects the CA-V5.b schema
[16]:
dataset = from_caipy_generic(
images_folder=None,
annotations_folder="../../test_lours/test_data/caipy_dataset/tags/custom_schema/",
use_schema=True,
)
specifying a fictive path for images : ../../test_lours/test_data/caipy_dataset/tags/Images
---------------------------------------------------------------------------
ValidationError Traceback (most recent call last)
Cell In[16], line 1
----> 1 dataset = from_caipy_generic(
2 images_folder=None,
3 annotations_folder="../../test_lours/test_data/caipy_dataset/tags/custom_schema/",
4 use_schema=True,
5 )
File ~/workspace/Bamboo/lours/dataset/io/caipy.py:326, in from_caipy_generic(images_folder, annotations_folder, dataset_name, split, splits_to_read, use_schema, json_schema, booleanize)
323 dataset += split_dataset
325 if len(dataset) == 0 and splits_to_read is None:
--> 326 dataset = load_caipy_split(
327 images_folder=images_folder,
328 annotations_folder=annotations_folder,
329 dataset_name=dataset_name,
330 split_name=split,
331 schema=schema,
332 )
334 if schema is not None:
335 image_schema = schema["properties"]["image"]
File ~/workspace/Bamboo/lours/dataset/io/caipy.py:132, in load_caipy_split(images_folder, annotations_folder, dataset_name, split_name, schema)
105 def load_caipy_split(
106 images_folder: Path,
107 annotations_folder: Path,
(...)
110 schema: dict | None = None,
111 ) -> Dataset:
112 """Load a particular caipy split folder and convert it to a lours Dataset
113
114 Args:
(...)
130 caipy splits
131 """
--> 132 images, annotations = load_caipy_annot_folder(annotations_folder, schema)
133 if images is not None:
134 if not images.index.is_unique:
File ~/workspace/Bamboo/lours/dataset/io/caipy.py:55, in load_caipy_annot_folder(folder_path, schema)
53 frame_data = json.load(f)
54 if validator is not None:
---> 55 validator.validate(frame_data)
56 if "type" in frame_data.keys():
57 assert (
58 frame_data["type"] == "instances"
59 ), "Only instance type supported for now"
ValidationError: "turquoise" is not one of ["red","green","yellow","blue","white","black","orange","purple","grey","brown","pink","beige","cyan"]
Failed validating "enum" in schema["properties"]["annotations"]["items"]["properties"]["attributes"]["properties"]["colors"]["items"]
On instance["annotations"][0]["attributes"]["colors"][1]:
"turquoise"
This one will succeed
Also note that booleanize columns are less numerous for attributes.actions and attributes.colors
[17]:
dataset = from_caipy_generic(
images_folder=None,
annotations_folder="../../test_lours/test_data/caipy_dataset/tags/custom_schema/",
use_schema=True,
json_schema=custom_schema,
)
specifying a fictive path for images : ../../test_lours/test_data/caipy_dataset/tags/Images
[18]:
dataset