doc_utils#
Module Attributes
The random attribute columns type is a way to design a column with random attributes. |
Functions
Generate a column with lists of elements taken in a finite pool the generated sequence of lists will be in the form of a numpy array, which will become a column in a DataFrame. |
|
Generate a Dummy dataset for demonstration purpose |
|
From a specification given according to the |
- construct_attribute_column(numpy_generator: Generator, n_rows: int, labels: Sequence[str], probs: Sequence[float] | None = None, is_list_column: bool = True) Categorical | list[list[str]][source]#
Generate a column with lists of elements taken in a finite pool the generated sequence of lists will be in the form of a numpy array, which will become a column in a DataFrame.
- Parameters:
numpy_generator – numpy random Generator object used to generate random integers
n_rows – number of rows of generated numpy array
labels – label strings to use for the attributes
probs – sequence of probabilities to construct each row. If set to None, will use the probabilities by default: for attribute lists, each probability will be 0.5, and for simple attribute, probabilities will be evenly distributed. Defaults to None.
is_list_column – if set to True, will construct a column with list of attributes, that constitute a subset of the set of labels. Otherwise, will simply construct a simple attribute column, where each row is a single label taken from
labelsaccording to the probability distribution given byprobs. Defaults to True.
- Returns:
list of lists that will be incorporated in a dataframe.
- dummy_dataset(n_imgs: int = 2, n_annot: int = 2, n_labels: int = 3, split_names: None | str | Sequence[str] = ('train', 'val', 'eval'), split_shares: Sequence[float] = (0.8, 0.1, 0.1), n_list_columns_images: int | Sequence[str] | Sequence[int | Sequence[float] | Sequence[str] | dict[str, float]] | dict[str, int | Sequence[float] | Sequence[str] | dict[str, float]] = 0, n_list_columns_annotations: int | Sequence[str] | Sequence[int | Sequence[float] | Sequence[str] | dict[str, float]] | dict[str, int | Sequence[float] | Sequence[str] | dict[str, float]] = 0, n_attribute_columns_images: int | Sequence[str] | Sequence[int | Sequence[float] | Sequence[str] | dict[str, float]] | dict[str, int | Sequence[float] | Sequence[str] | dict[str, float]] = 0, n_attributes_columns_annotations: int | Sequence[str] | Sequence[int | Sequence[float] | Sequence[str] | dict[str, float]] | dict[str, int | Sequence[float] | Sequence[str] | dict[str, float]] = 0, booleanize: Literal['all', 'random', 'none'] = 'none', keypoints_share: float = 0, add_confidence: bool = False, generate_real_images: bool = False, seed: int = 0, **existing_elements) Dataset[source]#
Generate a Dummy dataset for demonstration purpose
Might also be used for tests
- Parameters:
n_imgs – number of frame in the fake dataset
n_annot – number of annotations
n_labels – length of the label map
split_names – sequence containing names of the splits to apply to the dataset as a column of images dataframe. If set to None, no “split” column will be added to the images dataframe. If empty, will assume all splits are
None. If not empty, and with 2 elements or more, must be the same size assplit_shares. Defaults to("train", "val", "eval").split_shares – sequence containing share of each split whose name was given in
split_names. The ith element insplit_sharesrepresents the share (written as a float number between 0 and 1) of the dataset that will be assigned to this split. Ifsplit_namesis empty or has a length of 1, it will be ignored. Otherwise, its size must match length ofsplit_names, and the value must all add up to 1. Defaults to(0.8, 0.1, 0.1).n_list_columns_images – Definition of the attribute lists columns for images. A list column cell contains a subset of a larger set of possible attributes, fixed for the whole columns, in the form of a list or a set. These columns are designed to be booleanized and are created with the function
construct_list_column(). Seerandom_attribute_column_typefor an in depth explanation of the syntax. Defaults to 0n_list_columns_annotations – number of list columns to add to the annotations dataframe. A list column cell contains a subset of a larger set of possible attributes, fixed for the whole columns, in the form of a list or a set. These columns are designed to be booleanized and are created with the function
construct_list_column(). Seerandom_attribute_column_typefor an in depth explanation of the syntax. Defaults to 0n_attribute_columns_images – number of attributes columns to add to the images dataframe. An attribute column cell contains one element for a set fixed for the whole column. These columns are created with the function
construct_list_column(). Seerandom_attribute_column_typefor an in depth explanation of the syntax. Defaults to 0n_attributes_columns_annotations – number of attributes columns to add to the annotations dataframe. An attribute column cell contains one element for a set fixed for the whole column. These columns are created with the function
construct_list_column(). Seerandom_attribute_column_typefor an in depth explanation of the syntax. Defaults to 0booleanize –
how to booleanize the list columns. Can be “all”, “random” and “none”. Defaults to “none”.
”all” means all the list columns will converted to multiple boolean columns
”none” means the list columns will be unchanged
”random” means a random number of list columns will be booleanized. The number of booleanized columns is chosen randomly, and the choice of these n booleanized columns is also done randomly.
keypoints_share – Share of bounding box which are keypoints, i.e. with a height and width of 0. Set it to 1 to only have keypoints, and to 0 to have no keypoint. Defaults to 0.
add_confidence – If set to True, will add a “confidence” column to annotations with random values between 0 and 1. Use this option to generate random predictions, to be used in e.g. an evaluator. Defaults to False.
generate_real_images – if set to True, will generate random images and save them in the
/tmp/folder under a random file name. Otherwise, will just generate random file path to images without creating any. Defaults to False.seed – seed number for the generation. This will ensure that for a given seed number, the same dataset will be created.
**existing_elements – optional existing dataset elements that you want not to be random.
- Returns:
Dummy generated dataset
Example
>>> dummy_dataset() Dataset object containing 2 images and 2 objects Name : inside_else_memory Images root : such/serious Images : width height relative_path type split id 0 342 136 help/me.jpeg .jpeg train 1 377 167 whatever/wait.png .png train Annotations : image_id category_str category_id ... box_y_min box_width box_height id ... 0 0 step 15 ... 73.932999 71.552480 42.673983 1 0 why 19 ... 4.567638 248.551257 122.602211 [2 rows x 8 columns] Label map : {15: 'step', 19: 'why', 25: 'interview'}
Change the seed option to another random dataset following the same rules
>>> dummy_dataset(seed=1) Dataset object containing 2 images and 2 objects Name : shake_effort_many Images root : care/suggest Images : width height relative_path type split id 0 955 229 determine/story.jpg .jpg train 1 131 840 air/method.bmp .bmp train Annotations : image_id category_str category_id ... box_y_min box_width box_height id ... 0 1 listen 14 ... 276.974642 9.718823 184.684056 1 0 reach 22 ... 6.311037 123.141689 174.239136 [2 rows x 8 columns] Label map : {14: 'listen', 15: 'marriage', 22: 'reach'}
Use the
split_shareandsplit_namesto set splits values. Use thekeypoints_shareoption to set a share of bounding box with size of 0>>> dataset = dummy_dataset( ... 10, ... 100, ... split_shares=(0.5, 0.5), ... split_names=("foo", "bar"), ... keypoints_share=0.3, ... add_confidence=True, ... ) >>> dataset Dataset object containing 10 images and 100 objects Name : inside_else_memory Images root : such/serious Images : width height relative_path type split id 0 342 645 help/me.jpeg .jpeg foo 1 377 973 whatever/wait.png .png foo 2 136 756 chair/mother.gif .gif bar 3 167 669 someone/challenge.jpeg .jpeg foo 4 114 589 successful/present.bmp .bmp bar 5 257 603 no/where.jpeg .jpeg foo 6 831 941 play/take.tiff .tiff foo 7 684 349 bit/force.gif .gif bar 8 921 834 way/back.tiff .tiff bar 9 553 703 marriage/give.tiff .tiff foo Annotations : image_id category_str category_id ... box_width box_height confidence id ... 0 0 interview 25 ... 11.569934 591.860047 0.136767 1 3 step 15 ... 70.680613 101.235900 0.663684 2 8 interview 25 ... 0.000000 0.000000 0.749956 3 5 why 19 ... 99.047865 266.499060 0.163943 4 0 why 19 ... 69.419403 61.451991 0.689302 .. ... ... ... ... ... ... ... 95 7 step 15 ... 518.765436 55.277118 0.942361 96 0 step 15 ... 0.000000 0.000000 0.802246 97 5 interview 25 ... 0.000000 0.000000 0.122368 98 4 why 19 ... 89.054816 254.947600 0.124429 99 9 why 19 ... 181.630916 86.810354 0.616242 [100 rows x 9 columns] Label map : {15: 'step', 19: 'why', 25: 'interview'} >>> (dataset.annotations["box_width"] > 0).value_counts() / dataset.len_annot() box_width True 0.69 False 0.31 Name: count, dtype: float64
Add list columns, that can be booleanized later
>>> dummy_dataset(n_list_columns_images=1, n_list_columns_annotations=1) Dataset object containing 2 images and 2 objects Name : inside_else_memory Images root : such/serious Images : width height ... split discover id ... 0 342 136 ... train [chair, challenge] 1 377 167 ... train [someone, beyond, present, enough] [2 rows x 6 columns] Annotations : image_id category_str ... box_height where id ... 0 0 step ... 42.673983 [take, play, week, force, bit] 1 0 why ... 122.602211 [no, season, take, play, choice, bit] [2 rows x 9 columns] Label map : {15: 'step', 19: 'why', 25: 'interview'}
Or booleanize them right away
>>> dummy_dataset( ... n_list_columns_images=1, n_list_columns_annotations=1, booleanize="all" ... ) Dataset object containing 2 images and 2 objects Name : inside_else_memory Images root : such/serious Images : width height ... discover.present discover.someone id ... 0 342 136 ... False False 1 377 167 ... True True [2 rows x 11 columns] Annotations : image_id category_str category_id ... where.season where.take where.week id ... 0 0 step 15 ... False True True 1 0 why 19 ... True True False [2 rows x 16 columns] Label map : {15: 'step', 19: 'why', 25: 'interview'}
Add attribute columns which then are transformed into categorical columns.
>>> example = dummy_dataset( ... n_attribute_columns_images={"a": 2, "b": 3}, ... n_list_columns_annotations=2, ... ) >>> example Dataset object containing 2 images and 2 objects Name : inside_else_memory Images root : such/serious Images : width height relative_path type split a b id 0 342 136 help/me.jpeg .jpeg train play force 1 377 167 whatever/wait.png .png train take force Annotations : image_id ... where id ... 0 0 ... [] 1 0 ... [no, season] [2 rows x 10 columns] Label map : {15: 'step', 19: 'why', 25: 'interview'} >>> example.images["b"] id 0 force 1 force Name: b, dtype: category Categories (3, object): ['week', 'choice', 'force']
Instead of integers, use lists of probabilities to steer the distribution of attributes.
>>> example = dummy_dataset( ... 200, n_attribute_columns_images=[[0.1, 0.1, 0.8]], seed=1 ... ) >>> example Dataset object containing 200 images and 2 objects Name : shake_effort_many Images root : care/suggest Images : width height relative_path type split could id 0 955 488 determine/story.jpg .jpg train note 1 131 895 air/method.bmp .bmp train firm 2 229 880 political/lead.jpg .jpg train firm 3 840 384 like/safe.bmp .bmp train note 4 953 668 suffer/set.jpeg .jpeg train note .. ... ... ... ... ... ... 195 122 437 state/almost.tiff .tiff train firm 196 752 300 weight/tend.jpeg .jpeg train note 197 554 228 remember/summer.png .png train note 198 688 605 yet/though.png .png eval note 199 243 227 describe/road.tiff .tiff train note [200 rows x 6 columns] Annotations : image_id category_str category_id ... box_y_min box_width box_height id ... 0 77 reach 22 ... 45.427512 40.116677 318.073851 1 137 marriage 15 ... 202.481384 435.389400 475.375279 [2 rows x 8 columns] Label map : {14: 'listen', 15: 'marriage', 22: 'reach'} >>> example.images["could"].value_counts() / len(example) could note 0.82 firm 0.09 lead 0.09 Name: count, dtype: float64
Finally, you can generate fake images as well if you want to test the io functions that need images to be valid.
>>> dataset = dummy_dataset(generate_real_images=True) >>> dataset Dataset object containing 2 images and 2 objects Name : inside_else_memory Images root : /tmp/such/serious Images : width height relative_path type split id 0 342 136 help/me.jpeg .jpeg train 1 377 167 whatever/wait.png .png train Annotations : image_id category_str category_id ... box_y_min box_width box_height id ... 0 0 step 15 ... 73.932999 71.552480 42.673983 1 0 why 19 ... 4.567638 248.551257 122.602211 [2 rows x 8 columns] Label map : {15: 'step', 19: 'why', 25: 'interview'} >>> dataset.check() Checking Image and annotations Ids ... Checking Bounding boxes .. Checking label map ... Checking images are valid ...
- random_attribute_column_type = int | collections.abc.Sequence[str] | collections.abc.Sequence[int | collections.abc.Sequence[float] | collections.abc.Sequence[str] | dict[str, float]] | dict[str, int | collections.abc.Sequence[float] | collections.abc.Sequence[str] | dict[str, float]]#
The random attribute columns type is a way to design a column with random attributes.
It will create \(N\) columns, each \(i\) th column with \(M_i\) labels, the labels being distributed according to the probabilities in \((p_i)_j\) (\((p_i)_j\) being of length \(M_i\), with values \(p_{i,j}\) between 0 and 1).
In the case the column is an non-list attribute column, each vector \((p_i)_j\) must addup to 1. Otherwise, each probability \(p_{i,j}\) is the probability that the \(j\) th label of \(i\) th column is in the attribute list for each cell.
Depending on the type, the values \(N\), \(M\), \((p_i)_j\) and the names will be constructed differently.
If not specified, column header and labels are randomly generated with
Faker.unique.word()If not specified, the probabilities pi will be either uniform probabilities for non-list attribute columns, or all set to 0.5 for attribute list columns.
The input can be either
An integer: \(N\) is the given integer, \(M_i\) are random integers between 2 and 10
A sequence of integers: \(N\) is the length of the sequence, \(M_i\) are the integers of that sequence.
A sequence of str: \(N\) is the length of the sequence, the column headers are the sequence elements, and \(M_i\) are random integers between 2 and 10.
A sequence of sequences of float: \(N\) is the length of the sequence, \(M_i\) is the length of each \(i\) th sequence, and \((p_i)_j\) is the \(i\) th sequence of floats.
A dictionary of integers: \(N\) is the length of the dictionary. The column headers are the dictionary keys, and \(M_i\) are the integer values.
A dictionary of float sequences: \(N\) is the length of the dictionary. The column headers are the dictionary keys, \(M_i\) is the length of the \(i\) th float sequence, and \((p_i)_j\) is the \(i\) th float sequence
A dictionary of string sequences: \(N\) is the length of the dictionary. The column headers are the dictionary keys, \(M_i\) is the length of the \(i\) th string sequence, and the \(j\) th label of the \(i\) th column is the \(j\) th element of the \(i\) th sequence.
A dictionary of float dictionaries. \(N\) is the length of the root dictionary. The column headers are the dictionary keys, \(M_i\) is the length of the \(i\) th sub-dictionary, the \(j\) th label of the \(i\) th column is the \(j\) th key of the \(i\) th sub-dictionary and the probability \(p_{i,j}\) is the corresponding sub-dictionary value
- set_attribute_columns_labels(input_dataframe: DataFrame, columns_specs: int | Sequence[str] | Sequence[int | Sequence[float] | Sequence[str] | dict[str, float]] | dict[str, int | Sequence[float] | Sequence[str] | dict[str, float]], numpy_generator: Generator, fake_generator: Faker, is_list: bool = False, min_labels: int = 2, max_labels: int = 10) list[str][source]#
From a specification given according to the
random_attribute_column_typetype, add attribute columns to the given dataframe and return the name of added columns.Depending on
is_list, it will be either an attribute column, where each row has a single value, taken from a fixed set of possible string labels or an attribute list column where each row has a subsset of values from a fixed superset of possible string labels.- Parameters:
input_dataframe – DataFrame which will be assigned new columns
columns_specs – specification of columns, according to the aforementioned syntax
numpy_generator – random generator for numpy arrays
fake_generator – random generator for random unique words
is_list – if set to True, will construct list attribute columns. Otherwise, will construct simple attribute columns. Defaults to False
min_labels – When number of labels if not specified, minimum random number of labels to generate for the current column. Defaults to 2.
max_labels – When number of labels if not specified, maximum random number of labels to generate for the current column. Defaults to 10.
- Returns:
The header of added columns. Useful to keep track of list attribute columns to booleanize them.