Pipeline Submodule

This module contains the pipeline definition class as well as classes which are used to structure and manage the data inside the pipeline as well as the output of the pipeline.

class accvlab.dali_pipeline_framework.pipeline.PipelineDefinition(data_loading_callable_iterable, preprocess_functors=None, check_data_format=True, use_parallel_external_source=True, prefetch_queue_depth=2, print_sample_data_group_format=False)[source]

Bases: object

Definition for the data loading and pre-processing pipeline.

Configure with a data-loading functor and an ordered list of processing steps. Exposes utilities to retrieve the input data format (blueprint), infer the output data format by applying each step’s format-checking logic, and build a DALI pipeline that combined the data loading functor and the processing steps.

Parameters:
  • data_loading_callable_iterable (Union[CallableBase, IterableBase]) – Callable or iterable performing the loading of the data.

  • preprocess_functors (Optional[Sequence[Optional[PipelineStepBase]]], default: None) – Functors for the individual processing steps which will be executed in sequence on the input data. May contain None-elements, which are ignored. Optional, if not set, the loaded data is returned as is.

  • use_parallel_external_source (bool, default: True) – Whether to use the parallel external source.

  • prefetch_queue_depth (int, default: 2) – The depth of the prefetch queue. Only used if use_parallel_external_source is True.

  • print_sample_data_group_format (bool, default: False) – Whether to print the sample data group formats after each processing step during the setup of the pipeline (e.g. for debugging purposes).

property input_data_structure: SampleDataGroup

Get the input data format (blueprint).

The input blueprint is provided by the data-loading functor passed at construction time.

Returns:

SampleDataGroup blueprint object describing the input data format (no actual data).

check_and_get_output_data_structure()[source]

Infer and return the output data format (blueprint).

Starting from the input blueprint provided by the loading functor, each processing step validates compatibility and transforms the blueprint (e.g., adding fields or changing types). Steps are applied in sequence to obtain the final output blueprint. If an incompatibility is detected, an exception is raised.

Returns:

SampleDataGroupSampleDataGroup blueprint object describing the output data format (no actual data).

Raises:

ValueError – If the data loading functor is not compatible with the first processing step.

get_dali_pipeline(*args, **kwargs)[source]

Get the DALI pipeline as configured.

Note

This calls a function decorated with @pipeline_def used by DALI to create a pipeline object. The resulting pipeline object is returned. For more information on the possible arguments (i.e. *args and **kwargs in this function), see the documentation of the nvidia.dali.pipeline.experimental.pipeline_def() decorator.

Parameters:
  • *args – Arguments for the DALI pipeline.

  • **kwargs – Keyword arguments for the DALI pipeline.

Returns:

Pipeline – The DALI pipeline as configured.

class accvlab.dali_pipeline_framework.pipeline.DALIStructuredOutputIterator(num_batches_in_epoch, pipeline, sample_data_structure_blueprint, contained_dataset=None, dali_generic_iterator_class=<class 'nvidia.dali.plugin.pytorch.DALIGenericIterator'>, convert_sample_data_group_to_dict=True, post_process_func=None)[source]

Bases: object

Structured access to DALI pipeline output (as a nested dict or SampleDataGroup).

Designed as a drop-in replacement for a torch.utils.data.DataLoader. Optionally applies a user-defined lightweight post-processing function (e.g., conversions to types not supported by DALI).

Parameters:
  • num_batches_in_epoch (int) – Number of batches in an epoch. Note that this value is only used to output if len(obj) is called. It is not used internally and is added here to ensure drop-in compatibility with torch.utils.data.DataLoader.

  • pipeline (Pipeline) – DALI pipeline object.

  • sample_data_structure_blueprint (SampleDataGroup) – Blueprint for the output data structure.

  • contained_dataset (Optional[Any], default: None) – Dataset object which will be exposed via dataset (mirrors PyTorch DataLoader behavior). Can be a PyTorch Dataset or any other compatible object. Note that this object is not used internally. Also see dataset.

  • dali_generic_iterator_class (Union[Type[DALIGenericIterator], Any], default: <class 'nvidia.dali.plugin.pytorch.DALIGenericIterator'>) – Class for the internal DALI generic iterator. Follows the PyTorchDALIGenericIterator interface but may emit tensors for other frameworks. Defaults to PyTorchDALIGenericIterator.

  • convert_sample_data_group_to_dict (bool, default: True) – If True, convert output SampleDataGroup to a nested dict. Ensures drop-in compatibility with DataLoader when no post-processing function is provided.

  • post_process_func (Optional[Callable[[Union[SampleDataGroup, dict]], Union[SampleDataGroup, dict]]], default: None) – Optional post-processing function for the output. This can be e.g. used to convert data to types not supported by DALI. or perform other light-weight steps. The input is a SampleDataGroup object if convert_sample_data_group_to_dict == False and a dict otherwise. Note that this function is executed when the data is accessed in the thread accessing the data (typically the thread performing the training). Therefore, this function should be kept lightweight to avoid performance penalties.

class SimpleIterator(obj)[source]

Bases: Iterator

Iterator, which can e.g. be used as a drop-in replacement for a PyTorch DataLoader iterator.

Note that a single iterator should be used at any point in time. if multiple iterators are used, they share the state, i.e. getting a new iterator will reset all other iterators and calling next for one iterator will advance all iterators by one element.

__next__()[source]

Get the next element.

__iter__()[source]

Get the iterator (i.e. return self without re-starting the iteration).

reset()[source]

Reset the iterator.

Will call DALIStructuredOutputIterator.reset() for the parent object.

__len__()[source]

Get the number of elements (from parent object).

__iter__()[source]

Get an iterator.

Note that a single iterator should be used at any point in time. if multiple iterators are used, they share the state, i.e. getting a new iterator will reset all other iterators and calling next for one iterator will advance all iterators by one element.

Return type:

SimpleIterator

reset()[source]

Reset the current iteration progress (start over from the beginning).

Note that this will reset iterators of the object as well.

property sample_data_structure_blueprint: SampleDataGroup

Get the output data structure blueprint.

The blueprint is a SampleDataGroup representing the same nested data format as the output, without the actual data. See SampleDataGroup for details.

property internal_iterator: DALIGenericIterator | Any

Get the actual DALI iterator used to access the output data internally.

Note that by default, this is a nvidia.dali.plugin.pytorch.DALIGenericIterator. However, this can be changed in the constructor and in this case, the returned object will be of the type specified in the constructor.

property dataset: Any

Get the dataset object.

This is the dataset object set in the constructor (if any). If not set, this will return the object for which it is called. This property is used for compatibility with torch.utils.data.DataLoader.

__len__()[source]

Number of available batches.

Important

This value is set manually in the constructor, and only used to output it here. This is done as it is a part of the torch.utils.data.DataLoader interface. The value my be not the actual number of batches in the epoch, e.g. for non-epoch based pipelines.

classmethod CreateAsDataLoaderObject(*args, **kwargs)[source]
class accvlab.dali_pipeline_framework.pipeline.SampleDataGroup[source]

Bases: object

Structured container for sample data. Can also be used as a blueprint to describe the data format.

Data is organized as a tree containing:

  • Data fields: Leaf nodes that hold the actual data.

  • Data group fields: Non-leaf nodes that group related items.

Example

An example for accessing the data field "bounding_boxes" inside nested data group fields "camera" and "annotations":

>>> bounding_boxes = data["camera"]["annotations"]["bounding_boxes"]

Note that accessing the data is done as for a nested dictionary. Here, the data group fields are analogous to dict objects and data fields correspond to the actual stored values at the leaves.

Capabilities (see individual method docs for details):

  • Enforce a predefined data format (field names, order, and types). Format changes need to be performed explicitly.

  • Inside the input callable/iterable and outside the DALI pipeline the following can be performed (both can be disabled):

  • Inside the pipeline: Apply automatic type checks on assignment.

  • Render the tree in a human-readable form via print(obj).

  • Flatten values to a sequence and reconstruct from a sequence (see get_data(), set_data(), and set_data_from_dali_generic_iterator_output()). This is useful when passing the data from the input callable/iterable to the pipeline, and when returning data from the pipeline, as nested data structures are not supported there. Also see DALIStructuredOutputIterator for an output iterator which re-assembles the data from the flattened output into a SampleDataGroup instance or nested dictionaries before returning it.

  • Compare formats of two instances (see type_matches()). This also ensures that the flattened data obtained from one instance can be used to fill the data of another instance.

  • Utilities that facilitate implementation of pipeline steps: find/remove all occurrences of fields with a given name, add/remove/change fields and types, etc. (e.g. see find_all_occurrences()). Note that the search is performed at DALI graph construction time, so there is no overhead during the pipeline execution.

  • Supports passing strings through the DALI pipeline and obtaining them as strings in the pipeline output. Note that strings are not supported inside the DALI pipeline. They can be accessed/assigned as strings in the input callable/iterable and outside the DALI pipeline, but appear as uint8 tensors inside the pipeline itself (alternative: use a mapping to numeric values as described above).

Usage modes:

  • Blueprint: describes the data format (fields and types) but contains no values. This allows inferring downstream formats without running data processing (e.g., to initialize a DALI iterator). When only passing of flattened data is possible, a blueprint can be filled from flattened values (see get_data(), set_data()).

  • Container: holds actual values. When accessing the data, behaves similarly to a nested dictionary. When assigning data, additional checks/conversions are potentially performed.

Important

Assigning a Field Value

Assignment means using the indexed assignment operator obj[name] = value or the method obj.set_item_in_path(path, value).

When assigning data fields, the following holds:

  • Mappings and conversions will be performed on assignment (inside the input callable/iterable and outside the DALI pipeline; if not disabled). Inside the DALI pipeline itself, no mapping or conversion is applied.

  • Inside the DALI pipeline, type checks are performed instead on assignment and an error is raised if the type is not correct.

  • Assigning strings is only supported in the input callable/iterable and outside the DALI pipeline. String fields are handled as uint8 tensors inside the DALI pipeline.

When assigning to data group fields, the following holds:

  • The assignment succeeds only if the new value’s format matches the previous format, i.e. if obj[name].type_matches(value) holds. Otherwise, a KeyError is raised. This is done to prevent changing the data format implicitly by assigning a different type.

  • If the type needs to be changed, this needs to be done explicitly first (e.g., using change_type_of_data_and_remove_data()).

Important

Getting a Field Value

Getting a field value means using the indexed access operator obj[name] or the method obj.get_item_in_path(path).

Accessing strings inside the DALI pipeline (except for the input callable/iterable) will return the underlying uint8 tensor instead. Using strings directly is only supported in the input callable/iterable and outside the DALI pipeline.

Important

Changing the Data Format

Changing the data format is always explicit. For example, adding a field and assigning values is a two-step process: create the field first, then assign data. When defining a blueprint, fields are created but left empty.

Important

Type Checking

Type checking is performed on assignment to ensure that the data type is correct (inside the DALI pipeline). This is useful when developing the pipeline/processing step, but adds some overhead. Type checking is enabled by default (see set_do_check_type()).

Note

Additional information:

  • When converting a SampleDataGroup to a string (e.g., using print(obj)), the data format as well as some details (e.g., for which fields a mapping is defined, which fields are empty, data types of the fields) are printed. The actual stored values are not printed. For a more simple output, see get_string_no_details().

  • When obtaining the length of a SampleDataGroup (e.g., using len(obj)), the number of direct children (data fields and data group fields) is returned.

static create_data_field_array(type, num_fields, mapping=None)[source]

Create a SampleDataGroup containing multiple data fields of the same type.

The data fields have numerical (integer) names in the range [0; num_fields - 1]. This means that the returned SampleDataGroup behaves as an array of data fields.

Parameters:
Returns:

SampleDataGroup – Resulting array SampleDataGroup object

static create_data_group_field_array(sample_data_group, num_fields)[source]

Create a SampleDataGroup containing multiple data group fields (themselves SampleDataGroup instances).

Note that the created data group fields will be initialized as blueprints, i.e. they will not contain any actual data even if sample_data_group does. This is done to cleanly separate this step (defining the data format) from actually filling the data.

Parameters:
  • sample_data_group (SampleDataGroup) – Blueprint representing the element format. Any actual data present in sample_data_group will be ignored; the resulting elements will be empty of data.

  • num_fields (int) – Number of fields to create

Returns:

SampleDataGroup – Resulting array SampleDataGroup object

set_apply_mapping(apply)[source]

Set whether to apply string to numeric mapping (for data fields where such a mapping is defined).

This setting will be propagated to descendants (data group fields) of the data group field for which it is called.

Note

The mapping is applied in the input callable/iterable and outside the DALI pipeline. Inside the DALI pipeline itself, the mapping is not applied. If apply mapping is set to True and an assignment is performed inside the pipeline, a warning will be issued, and the assignment will be performed without mapping (if it is already in the correct format; an error will be raised if the format is not correct).

Parameters:

apply (bool) – Whether to apply the mapping (for fields where a mapping is set).

set_do_convert(convert)[source]

Set whether to convert data in the data fields to the types set up when creating those fields.

This setting will be propagated to descendants (data group fields) of the data group field for which it is called.

Note

The conversion is applied in the input callable/iterable and outside the DALI pipeline. Inside the DALI pipeline itself, the conversion is not applied. Instead, type checks are performed (regardless of this setting).

Parameters:

convert (bool) – Whether to perform automatic type conversions (e.g., integers to floats) on assignment.

set_do_check_type(check_type)[source]

Set whether to perform type checking on assignment.

This setting will be propagated to descendants (data group fields) of the data group field for which it is called.

Note

The type checking is useful when developing the pipeline/processing step, but adds some overhead. Therefore, it is advisable to disable it in production.

Parameters:

check_type (bool) – Whether to perform type checking (in the DALI pipeline) on assignment.

get_empty_like_self()[source]

Get an object with the same structure (same nested data group fields and data fields), but no values.

Obtain a blueprint either from another blueprint or from a populated object (ignoring values and initializing all data fields as empty). This can be regarded as a deep-copy of the original object, but with the actual data removed.

Returns:

SampleDataGroup – Resulting blueprint SampleDataGroup object.

get_copy()[source]

Get a copy.

Create a copy: equivalent to get_empty_like_self() followed by filling the data from the original object. Note that for the actual data, references to the original data are used, i.e. the data itself is not deep-copied. However, the data group fields making up the data format are deep-copied.

This means that modifying the data in place will modify the data in the original. However, assigning new data to fields, adding or deleting fields, changing their type etc. will not affect the original.

Returns:

SampleDataGroup – Resulting copy

type_matches(other)[source]

Check whether the data type defined by two objects of SampleDataGroup is the same.

The following is not considered when checking for equality as it is not considered to be part of the type described by the object: :rtype: bool

  • The actual data stored in the data fields

  • Whether mapping and conversion should be performed

  • Whether mappings are available for the same fields and whether mappings themselves are the same

Important

Note that it is checked whether the fields appear in the same order in the two objects. This is the case if the objects are constructed from the same blueprint (or if they were constructed by adding the individual fields in the same order). This is important as it defines whether the flattened data, e.g. obtained by get_data() from one of the objects can be used to fill the data into the other one, e.g. using set_data().

set_item_in_path(path, value)[source]

Assign a field value at a (nested) path.

The path is a sequence of field names/keys. For example, if the path is path = ("name_1", "name_2", "name_3"), the following are equivalent:

  • obj.set_item_in_path(path, value_to_set)

  • obj["name_1"]["name_2"]["name_3"] = value_to_set

Important

See the class docstring for details on the assignment behavior.

Parameters:
get_item_in_path(path)[source]

Get a field value at a nested path.

The path is a sequence of field names/keys. For example, if path = ("name_1", "name_2", "name_3"), the following are equivalent:

  • value = obj.get_item_in_path(path)

  • value = obj["name_1"]["name_2"]["name_3"]

Note

Accessing strings inside the DALI pipeline (except for the input callable/iterable) will return the underlying uint8 tensor instead. Using strings directly is only supported in the input callable/iterable and outside the DALI pipeline.

Parameters:

path (Union[str, int, Tuple[Union[str, int]], List[Union[str, int]]]) – Path of the item to get.

Returns:

Any – Item at path.

get_parent_of_path(path)[source]

Get the parent of an element described in path.

The following are equivalent:
  • obj.get_parent_of_path(path)

  • obj.get_item_in_path(path[:-1])

Note

As a parent node cannot be a data field (i.e. a leaf node), the returned value is always a SampleDataGroup instance.

Parameters:

path (Union[int, str, Tuple[Union[str, int]], List[Union[str, int]]]) – Path for which to get the parent.

Returns:

SampleDataGroup – Parent of the path.

get_type_of_item_in_path(path)[source]

Get the type of the item at a nested path.

Parameters:

path (Union[Tuple[Union[str, int]], List[Union[str, int]]]) – Path to the item.

See also

Returns:

Union[DALIDataType, type] – Data type of the field. For data group fields, SampleDataGroup. For data fields, the corresponding nvidia.dali.types.DALIDataType. If path is empty, returns self.

static path_is_single_name(path)[source]

Check if the path given is a single name.

Parameters:

path (Union[str, int, Tuple[Union[str, int]], List[Union[str, int]]]) – Path to check. Can be a single name/key or a sequence of names.

Returns:

boolTrue if path is a single name/key (i.e., a string or integer, not a sequence), False otherwise.

path_exists(path)[source]

Check if a field with the given path exists.

Parameters:

path (Union[str, int, Tuple[Union[str, int]], List[Union[str, int]]]) – Path to check.

Returns:

bool – Whether field with given path exists.

path_exists_and_is_data_group_field(path)[source]

Check if a field with the given path exists and is a data group field.

Parameters:

path (Union[str, int, Tuple[Union[str, int]], List[Union[str, int]]]) – Path to check

Returns:

boolTrue if field at path exists and is a data group field, False otherwise

get_type_of_field(name)[source]

Get type of a field.

The type is either expressed as a nvidia.dali.types.DALIDataType (data fields) or SampleDataGroup (data group fields).

Parameters:

name (Union[str, int]) – Name of the field.

Returns:

Union[DALIDataType, type] – Type of the field. For string fields this returns nvidia.dali.types.DALIDataType.STRING. Note that this is different from flattened contexts (e.g., field_types_flat), where strings are represented as nvidia.dali.types.DALIDataType.UINT8. This is as the flattened data is used internally to pass data between SampleDataGroup objects where the object itself cannot be passed and consequently, the string data is passed as stored internally (i.e. the underlying uint8 tensors). Here, the actual type as configured (e.g. by add_data_field()) is returned.

get_string_no_details()[source]

Get string representing the SampleDataGroup instance, omitting details.

Omits per-field details such as whether a value is set and whether a mapping is available.

Return type:

str

is_array(field=None)[source]

Check whether (self or child) object can be regarded as an array.

This is the case if all of the following hold:
  • The field names have integer numeric names.

  • Each element in the range [0; len(self) - 1] is present as a name.

  • The value order is such that for each element, the name increases by 1, i.e. self.contained_top_level_field_names == (0, 1, 2, 3, ...).

Parameters:

field (Union[str, int, None], default: None) – If set, perform the check for the named child. Otherwise, check self.

Returns:

bool – Whether the object can be considered an array.

is_data_field_array(field=None)[source]

Check whether (self or child) object is an array whose elements are all data fields (no data group fields).

See documentation of is_array() for conditions for a data group field to be regarded as an array.

Parameters:

field (Union[str, int, None], default: None) – If set, perform the check for the named child. Otherwise, check self.

Returns:

bool – Whether the object is an array of data fields.

is_data_group_field_array(field=None)[source]

Check whether (self or child) object is an array whose elements are all data group fields (no data fields).

See documentation of is_array() for conditions for a data group field to be regarded as an array.

Parameters:

field (Union[str, int, None], default: None) – If set, perform the check for the named child. Otherwise, check self.

Returns:

bool – Whether the object is an array of data group fields.

property contained_top_level_field_names: Tuple[str | int]

Get the names of the contained top-level fields.

The order of the fields corresponds to the order in which they were added.

Returns:

Names of contained fields.

property field_top_level_types: Tuple[DALIDataType | type]

Types of the top-level fields.

The order of the fields corresponds to the order in which they were added (and to the order of the elements returned by contained_top_level_field_names).

Types fields are nvidia.dali.types.DALIDataType instances for data fields and SampleDataGroup blueprints for data group fields.

property field_names_flat: Tuple[str]

Names of contained data fields flattened (all leaf nodes, not only direct children).

Each element corresponds to a data field (leaf node). Original nesting is reflected in the names (concatenated with “.” between parent and child). Numerical names are converted to strings to ensure that they can be used as names in other places (e.g. DALI generic iterator). For example, the numeric name 5 would become "[5]". For example, if there is a data field in the original object in the path object["name_0"][1]["name_2"], the name used in the flattened tuple of names would be "name_0.[1].name_2".

The order of the elements corresponds to the order used in get_data(), so that the names obtained here correspond to the values obtained there.

No names are added for data group fields themselves. If they contain descendants which are data fields, their name will appear in the name of the descendants (before “.”). However, if a data group field does not contain any data field descendants, it will not contribute a name to the output.

Note

The names themselves reflect the hierarchy of the data, so that the names are unique, even if there are multiple fields with the same name in the structure.

property field_types_flat: Tuple[DALIDataType]

Types of contained data fields flattened (all leaf nodes, not only direct children).

Each element corresponds to a leaf node.

The order of the elements corresponds to the order used in get_data(), so that the types obtained here correspond to the values obtained there.

No types are added for data group fields themselves. If they contain descendants which are data fields, the types of these descendants will be added. However, if a data group field does not contain any data field descendants, it will not contribute a type to the output.

Note

As only the leaf nodes containing data are considered, no entries directly corresponding to data group fields will be added.

String fields are represented as nvidia.dali.types.DALIDataType.UINT8, matching their in-pipeline representation. Note that this is different from e.g. get_type_of_field(), but consistent with get_data() (see get_data() for details on the rationale).

get_data(as_list_type=False)[source]

Get values of all data fields as a flattened sequence (all leaf nodes, not only direct children).

The order of the elements is the order of a depth-first traversal with the order of the children at each node corresponding to the order in which the elements were added (consistent with, e.g., contained_top_level_field_names). The order is the same as in field_names_flat and field_types_flat, so that these can be used to obtain information about the individual elements of the obtained sequence of values. Only data fields (leaf nodes that are not SampleDataGroup) contribute values. Data group fields are not included directly, but their data field descendants contribute values.

Note

The tuple returned by this function can be used directly to
  • Pass parameters from an input callable/iterable to the DALI pipeline.

  • Return the final output of the DALI pipeline.

In these cases, the returned sequence can be used to fill the original data structure (using set_data() or set_data_from_dali_generic_iterator_output()) into a SampleDataGroup blueprint object with the same format as self.

Important

For string data fields, the values are the underlying uint8 arrays/tensors (or DataNodes), not Python str objects (both inside and outside the DALI pipeline). This method is designed to exchange data between SampleDataGroup objects and directly returns the underlying data, with the encoded strings. The conversion to Python str objects is performed when the data is obtained, e.g. using the indexed access operator [] or get_item_in_path().

Parameters:

as_list_type (bool, default: False) – If True, return a list (tuple otherwise).

Returns:

Union[tuple, list] – Sequence of values of all data fields.

set_data(data)[source]

Set values of all descendant data fields from a flattened sequence.

The sequence needs to contain the data in the same order as indicated by field_names_flat. If the flat data was obtained by get_data() from a SampleDataGroup object with the same data format as self, this will always be the case. The compatibility between the object from which the flattened data was obtained and this instance can be checked with type_matches().

Important

When setting data in this way, no conversions or mappings are applied (both inside and outside the DALI pipeline). This method is designed to exchange data between SampleDataGroup objects and expects the data as stored in the SampleDataGroup object (i.e., already converted and with mappings applied) as input.

Parameters:

data (Union[tuple, list]) – Flat sequence of values to use.

set_data_from_dali_generic_iterator_output(data, index)[source]

Set values from the output of a DALI generic iterator.

The DALI generic iterator refers to nvidia.dali.plugin.pytorch.DALIGenericIterator or any other iterator which follows the same interface (tensor types may be from a different framework).

The iterator (and therefore, the underlying DALI pipeline) must output the flattened data in the format as this instance (using get_data()), with names assigned in the iterator to the individual fields matching field_names_flat of this object. The compatibility between the object from which the flattened data was obtained and this instance can be checked with type_matches().

See also

get_like_self_filled_from_iterator_output()

Note

Values for string fields are uint8 arrays/tensors (not Python strings). For details, see get_data().

Parameters:
  • data (List[Dict[str, Any]]) – Output of the DALI generic iterator.

  • index (int) – Index inside data from which to fill the data.

has_child(name)[source]

Check whether a direct child with the given name exists.

Parameters:

name (Union[str, int]) – Name of the child to check

Returns:

bool – Whether child exists.

add_data_field(name, type, mapping=None)[source]

Add a data field as a direct child.

Data field means that the field contains actual data, i.e. is not another data group field (SampleDataGroup instance).

Note

If a mapping is defined, it is applied both to strings and to (possibly nested, multi-dimensional) sequences of strings (lists/tuples/arrays). The mapping is a dictionary from original string values to numeric values. The special key None provides a default value for unmatched inputs.

The mapping is only applied when data is assigned inside the input callable/iterable or outside the DALI pipeline. The mapping is not performed for assignments inside the actual DALI pipeline (and setting data there is only supported directly using numerical values).

Note

Alternatively to using a mapping, strings can be directly assigned to data fields by setting the data type to nvidia.dali.types.DALIDataType.STRING. However,

  • String processing in this way is only supported inside the input callable/iterable and outside the DALI pipeline, and such strings appear as uint8 tensors inside the DALI pipeline.

  • Only single strings can be assigned, not sequences of strings (although outputting 1D sequences of strings is supported to enable output of batch-wise data).

  • Often, using a mapping is advantageous to meaningfully process the data in the pipeline and also needs to be performed for other reasons (e.g. to convert class labels from strings to integers to be used in the loss computation).

This way of handling strings is e.g. useful to pass sample tags or other high-level descriptors through the pipeline.

Parameters:
  • name (Union[str, int]) – Name of the field to add

  • type (DALIDataType) – Type of (the elements of) the field to add. If a mapping is used, this is the type after mapping is applied.

  • mapping (Optional[Dict[Optional[str], Union[int, float, number, bool]]], default: None) – Mapping defining the mapping from input string values to numerical values. The conversion from string to numeric happens at data assignment (if applying mapping is not disabled). None can be added as a key to the mapping. In this case, the respective value is used if the input string(s) do not match any of the other keys. Mapping is applied both if a single string is assigned, but also for (n-dimensional) sequences of strings. Note that if a mapping is set, numeric values can still be assigned directly to the data field alternatively to strings.

add_data_group_field(name, blueprint_sample_data_group)[source]

Add a data group field as a direct child.

Data group field means a child of the type SampleDataGroup, which itself can contain data fields and/or data group fields. Data group fields are used to group elements together logically.

blueprint_sample_data_group acts as a blueprint. A new empty instance with the same format is created and added as the child. Values can be assigned later directly (or via set_item_in_path()).

Parameters:
add_data_field_array(name, type, num_fields, mapping=None)[source]

Add a data field array.

Add a child data group field (type SampleDataGroup) that contains num_fields elements, each with the type and mapping defined here. Elements are added with integer names from 0 to num_fields - 1, so the child behaves like an array.

Note

If a blueprint of the array is already created as another, independent blueprint, you can use add_data_group_field() to add the blueprint to this object.

Parameters:
  • name (str) – Name of the array data group field to add

  • type (DALIDataType) – Type of the fields to add to the array data group field

  • num_fields (int) – Number of fields to add to the array data group field

  • mapping (Optional[Dict[Optional[str], Union[int, float, number, bool]]], default: None) – Optional mapping for the fields (see add_data_field() for details on mappings).

add_data_group_field_array(name, blueprint_sample_data_group, num_fields)[source]

Add a data group field array.

Add a child data group field (type SampleDataGroup) that contains num_fields elements, each matching the provided blueprint. Elements are added with integer names from 0 to num_fields - 1 so the child behaves like an array.

Note

If a blueprint of the array is already created as another, independent blueprint, you can use add_data_group_field() to add the blueprint to this object.

Parameters:
  • name (str) – Name of the array data group field to add

  • blueprint_sample_data_group (SampleDataGroup) – SampleDataGroup describing the element format (each element is initialized from get_empty_like_self() of the blueprint).

  • num_fields (int) – Number of elements to add.

remove_field(name)[source]

Delete the direct child with the given name.

Parameters:

name (Union[str, int]) – Name of the child to remove.

remove_all_occurrences(name_to_remove)[source]

Remove all fields with a given name.

All fields with a given name are removed in the tree of which self is the root, i.e. of this node and its descendants.

See also

remove_field()

Parameters:

name_to_remove (Union[str, int]) – Name of the field(s) to remove

find_all_occurrences(name_to_find)[source]

Find all occurrences of fields with a given name.

The search is performed in the tree where self is the root, i.e. of this node and its descendants.

Parameters:

name_to_find (Union[str, int]) – Name of the field(s) to find

Returns:

Tuple[Tuple[Union[str, int]]] – Paths to the found fields. If none were found, an empty tuple is returned. The individual paths are themselves tuples. For example, the path ("name_1", "name_2", "name_3") would denote the element self["name_1"]["name_2"]["name_3"].

get_num_occurrences(name_to_find)[source]

Get the number of occurrences of fields with a given name.

Returns the number of occurrences in the tree where self is the root, i.e. of this node and its descendants.

Parameters:

name_to_find (Union[str, int]) – Name to search for.

Returns:

int – Number of occurrences

change_type_of_data_and_remove_data(path, new_type, new_mapping=None)[source]

Change the type of a child field and remove its data.

The data is removed as it is incompatible with the new type. Note that removing the data means resetting the reference, not actively deleting the data.

Example

A typical use case would be:

  1. Get the data of which the type should be changed, e.g.: data = obj["name"]

  2. Change the data type

    1. Change the data type as stored in the structure, e.g.: obj.change_type_of_data_and_remove_data("name", dali.types.DALIDataType.FLOAT)

    2. Convert the actual data, e.g.: data = dali.fn.cast(data, dtype=types.DALIDataType.FLOAT)

  3. Write data back, e.g.: obj["name"] = data

Note that instead of "name", a nested path can be used.

Parameters:
get_flat_index_first_discrepancy_to_other(other)[source]

Get the first flat index where two instances differ in field structure, name, or type.

Compares flattened field names and types (see field_names_flat, field_types_flat). The flattened names include full paths, making structural differences visible. Empty sample data group nodes (no data field descendants) are ignored.

Parameters:

other (SampleDataGroup) – Other SampleDataGroup instance to compare to.

Returns:

int – Index where the first difference is present, or -1 if there are no differences. Note that string fields are compared as nvidia.dali.types.DALIDataType.UINT8 in the flattened types, matching field_types_flat.

ensure_uniform_size_in_batch(fill_value)[source]

For each data field, ensure uniform size in batch by padding with fill_value.

This is equivalent to calling dali.fn.pad(field_values) for all contained data fields (in this data group field, and its descendants).

Warning

  • This method needs to be called inside the DALI pipeline (except the input callable/iterable).

  • Scalar (i.e. 0D) tensors are not supported. If such tensors are present, an error will be raised.

Parameters:

fill_value (Union[int, float]) – Fill value to be used for the padded region.

ensure_uniform_size_in_batch_for_all_strings()[source]

Ensure uniform size in batch for all string data fields.

This is useful before outputting from the DALI pipeline in a format that expects uniform size. A padding with 0-values is performed for all string data fields. This is done for all contained string data fields (in this data group field, and its descendants).

Note

When obtaining the data as strings, the padding is removed and only the actual data is returned.

is_data_field(name)[source]

Check whether a child field is a data field.

Parameters:

name (Union[str, int]) – Name of the child field to check.

Returns:

bool – Whether the child field is a data field (contains values) as opposed to a data group field (field of type SampleDataGroup).

is_data_group_field(name)[source]

Check whether a child field is a data group field.

Parameters:

name (Union[str, int]) – Name of the child field to check.

Returns:

bool – Whether the child field is a data group field (field of type SampleDataGroup).

to_dictionary()[source]

Get a nested dictionary with the same (nested) data structure and contained values.

This and descendants SampleDataGroup objects are converted to dict objects. Contained strings are returned as Python strings.

Returns:

dict – Resulting dictionary.

static get_numpy_type_for_dali_type(dali_type)[source]

Get the numpy dtype corresponding to a DALI data type. :rtype: type

Note

Only numeric and boolean DALI types are supported. A ValueError is raised for unsupported types.

check_has_children(data_field_children=None, data_group_field_children=None, data_field_array_children=None, data_group_field_array_children=None, current_name=None)[source]

Check that required children are present; raise ValueError if not.

Convenience helper for validating presence and kinds of children.

Parameters:
  • data_field_children (Union[str, int, Sequence[Union[str, int]], None], default: None) – Required child names which must be data fields.

  • data_group_field_children (Union[str, int, Sequence[Union[str, int]], None], default: None) – Required child names which must be data group fields.

  • data_field_array_children (Union[str, int, Sequence[Union[str, int]], None], default: None) – Required child names which must be arrays of data fields.

  • data_group_field_array_children (Union[str, int, Sequence[Union[str, int]], None], default: None) – Required child names which must be arrays of data group fields.

  • current_name (Optional[str], default: None) – Name of the current element. Optional, only used to provide clearer error messages.

Raises:

ValueError – If a required child is not present or is not of the expected type.