schema_util#

fill_with_dtypes_and_default_value(schema: dict, input_dataframe: DataFrame, separator: str = '.') DataFrame[source]#

Given a schema and dataframe constructed on a list of corresponding dicts, avoid having NaN values by setting the default value when possible.

It is expected that the DataFrame is constructed with pandas.json_normalize()

Parameters:
  • schema – JSON schema describing expected input format of input dicts.

  • input_dataframe – input dataframe with possible missing values (and thus set to NaN)

  • separator – Character used to separate name in flattened key. Defaults to “.”.

Returns:

DataFrame similar to input_dataframe but with NaN replaced with default values

when possible

flatten_schema(schema: dict, separator: str = '.', prefix: str | None = None) list[str][source]#

From a particular schema, get a list of expected key values if the schema was to be flattened by e.g. the function pandas.json_normalize()

Note

This function is meant to be called recursively, hence the prefix option.

Parameters:
  • schema – JSON schema describing expected output format

  • separator – Character used to separate name in flattened key. Defaults to “.”.

  • prefix – Prefix to apply to column names in output dictionary values. Defaults to None.

Returns:

list of flattened column names.

get_dtypes_and_default_values(schema: dict, separator: str = '.') tuple[dict, dict][source]#

Given a schema, find default values and dtypes to set to a flattened version of a dict corresponding to the schema.

For optional integers and booleans we use pandas’ Nullable dtypes when np.nan is replaced with pd.NA. Otherwise, these columns will get casted to float as soon as a value is missing. See pandas.BooleanDtype and pandas.UInt64Dtype

Parameters:
  • schema – JSON schema describing expected input format of input dicts.

  • separator – Character used to separate name in flattened key. Defaults to “.”.

Returns:

Dictionary with same keys as the flattened dictionary, and with the default values as values. If no default could be found (ambiguous type), the key is not present.

get_enums(schema: dict, separator: str = '.', ignore_pattern: str = 'a^') dict[str, set][source]#

From a schema, get column names that can be converted to sets of boolean columns.

Each outputted column will be associated to the list of possible values in output dictionary

Parameters:
  • schema – JSON schema dict describing the expected format of input data

  • separator – Separator to apply for path to get flattened paths in the dataset’s DataFrames. Defaults to “.”.

  • ignore_pattern – column following this regex pattern will be ignored. Defaults to “a^”.

Returns:

Dictionary describing enum columns and possible values (and thus created columns)

get_remapping_dict_from_names(names: frozenset[str] | tuple[str, ...], separator: str = '.') dict[str, list[str]][source]#

From a set of names, get the expected nested dictionary shape, assuming that a key with two names separated with the given separator means a nested dictionary shape.

For example “a.b” means output shape is of the form {a: {b: value}}

Note

For the LRU cache to be used, the given names must hashable, either tuple or frozenset.

Parameters:
  • names – Set of names to parse the underlying structure from.

  • separator – Character used to separate name in flattened key. Defaults to “.”.

Returns:

Nested remapping dictionary with values set to flattened dictionary key to take values from.

get_remapping_dict_from_schema(schema: dict, separator: str = '.', prefix: str | None = None) dict[source]#

From a particular schema, get a nested dictionary similar to the expected format of given schema.

Each value of that dictionary will be the name of column to get the value from in flattened DataFrame.

Note

This function is meant to b called recursively, hence the prefix option.

Parameters:
  • schema – JSON schema describing expected output format

  • separator – Character used to separate name in flattened key. Defaults to “.”.

  • prefix – Prefix to apply to column names in output dictionary values. Defaults to None.

Returns:

Nested dictionary following format described in schema, and providing mapping for nested DataFrames with flattened column names.

load_json_schema(schema_path: str | Path) dict[source]#

Load JSON schema file, either from a url or a file path.

If no schema path or url is given, an example following coco is loaded.

Parameters:

schema_path – Name of internal schema, or path to custom schema.

Raises:

KeyError – Errors when a string is given but no corresponding json file is found in the schemas folder.

Returns:

Loaded schema dictionary

remap_dict(flattened_dict: dict, mapping_tree: dict | None = None) dict[source]#

From a mapping tree, convert a flattened dict, possibly taken from a DataFrame into a nested dictionary.

Parameters:
  • flattened_dict – dictionary without sub-dictionary, easily readable by pandas.

  • mapping_tree – nested dictionary following expected output shape. Each value represents. the key name from flattened dictionary to take the value from. If set to None, will deduce it from the key names and separator character “.”. Defaults to None.

Returns:

Remapped nested dictionary

Modules

schema_util_functions

Set of utility function to use json schemas for loading caipy json files