Input – Passing Data to the Pipeline

Input Callables & Iterables

The input data for the pipeline is provided by a callable or iterable class, i.e. an object which implements the IterableBase or CallableBase interface. These classes are expected to provide the data & format blueprint as follows:

Providing the data format blueprint (also see Data Format: Sample Data Group):

Described by a SampleDataGroup blueprint (i.e. object with the data format set up, but without any actual data)

Needs to be returned by the overriden CallableBase.used_sample_data_structure() or IterableBase.used_sample_data_structure()

Providing the actual data:

Data is output from the input callable/iterable as a flat sequence of data fields, as can be obtained by calling get_data() on a SampleDataGroup object where the data format corresponds to the provided blueprint

Note that while for both input callable and iterable, the data structure is the same (as described by the blueprint), the actual data fields (returned as a flat sequence) are different in that:

For input callables, each data field corresponds to one sample

For input iterables, each data field corresponds to one batch

The data needs to be returned by the overriden CallableBase.__call__() or IterableBase.__next__()

Note

The flattened data sequence returned by the input callable/iterable is converted back into the structured format automatically by the pipeline using the provided blueprint, so that the user does not need to worry about the conversion and can assume that SampleDataGroup objects are used throughout the pipeline.

The blueprint provided by the input callable/iterable is used by the pipeline to obtain the data format after each processing step and check for compatibility (see Pipeline & Processing Steps).

Note

The data format for the output of the pipeline is in turn needed to auto-convert the output from the flat sequence back into the structured format, e.g. using the DALIStructuredOutputIterator (see Output – DALI Structured Output Iterator).

Input Data Provider

The pre-defined input callables/iterables are generic, and do not assume a specific dataset. Dataset- (and use-case-) specific functionality is provided by a class implementing the DataProvider interface. Such data providers can be used by the input callable/iterable to read the actual data from the dataset. The task of data reading is split as follows:

Input callable/iterable:

Define which sample(s) to load

Use the data provider to load the actual data

Convert the data to the flat format which needs to be returned by the input callable/iterable

When outputting the data format ( used_sample_data_structure() or used_sample_data_structure()), this information is internally obtained from the data provider

Similarly, the length of the dataset is internally derived from the data provider (but may be modified, e.g. by sharding, dropping of samples, converting to number of batches, etc.)

Data provider:

Given a sample index, return the corresponding sample data (see get_data())

Define the data format for the sample data (see sample_data_structure())

Provide the number of samples in the dataset (see get_number_of_samples())

This means that while the pre-defined input callables/iterables are a fixed and re-usable part of the package, the data provider is specific and needs to be implemented by the user of the package.