API Reference 

class accvlab.batching_helpers.RaggedBatch(tensor, mask=None, sample_sizes=None, non_uniform_dim=None)[source]

Class for representing batches with samples with variable size in one dimension.

The representation of the batch contains 3 tensors:

tensor:
This is the actual data. It has the size of the largest sample in the non-uniform dimension, and the other samples are padded on the “right” (i.e. at the end containing larger indices) with filler values to match the size of the largest sample. While the padding is typically initialized with 0, no values should be assumed for the padded region as the values there may change after operations are performed on the data. If the non-uniform dimension is dim==num_batch_dims, the shape is (*batch_dims_shape, max_sample_size, *data_shape). More generally, the first dimensions are the batch dimensions (one or more). The non-uniform dimension can be any dimension after the batch dimensions and the size of the non-uniform dimension always corresponds to the maximum sample size in the batch. The remaining dimensions correspond to the shape of the data, which can have any number of dimensions, including 0 (per-object scalar data).
mask:
This is the mask indicating which elements are valid (True) and which are not (False). It has dimensions: (*batch_dims_shape, max_sample_size). The dimension after the batch dimensions corresponds to the non-uniform dimension in the data tensor.
sample_sizes:
Sizes of the individual samples, i.e. the actual sizes without padding along the non-uniform dimension. Shape: (*batch_dims_shape,)

Additional attributes describing the batch:

non_uniform_dim:
Indicates which dimension is the non-uniform dimension
num_batch_dims:
Number of batch dimensions at the beginning of the tensor

Note

The tensors described above correspond to the tensor, mask, and sample_sizes attributes, respectively. The non-uniform dimension can be accessed as non_uniform_dim and the number of batch dimensions as num_batch_dims.

Important

The mask and non_uniform_dim attributes may be shared between instances of RaggedBatch instances with different data tensors, so they should be treated as constants and never be changed in-place.

Example

Here, we show an example of a RaggedBatch instance.

In the image::

Letters indicate data entries that are valid (i.e. correspond to the actual data).
‘*’ indicates padded filler entries (i.e. invalid entries) in the data.

Note that:

The example shows a single batch dimension of size 4. More batch and data dimensions are supported.
The maximum sample size (i.e. the size of the non-uniform dimension) is 3.
Each element in self.tensor may represent a single value (corresponding to scalar data and 0 data dimensions), or itself represent a non-scalar entry (in case for one or more data dimensions).
Even if more data dimensions are present, the mask has always num_batch_dims + 1 dimensions, as the data dimensions are not needed in the mask.
The sample_sizes have the same shape as the batch dimensions (i.e. (4,) in this example), as they contain one value per sample.
The sample_sizes and mask contain the same information. However
- Dependent on the use case, one of them may be more efficient & convenient to use
- One can be efficiently computed from the other (as is done as needed in the RaggedBatch implementation).

Note

The number of batch dimensions is determined from the shape of the provided mask or sample_sizes tensor.

Warning

If both mask and sample_sizes are set, they need to be consistent with each other. This is not checked in the constructor. Inconsistent masks and sample sizes will lead to undefined behavior.

Parameters:

tensor (Tensor) – Data to be stored (corresponding to the tensor tensor of RaggedBatch, see description above)
mask (Optional[Tensor], default: None) – Mask indicating which entries are valid (corresponding to the mask tensor of RaggedBatch, see description above). If not set, sample_sizes is internally used to create a mask. Note that at least one of mask or sample_sizes needs to be set.
sample_sizes (Optional[Tensor], default: None) – Number of valid entries for all samples (corresponding to the sample_sizes tensor of RaggedBatch, see description above). If not set, mask is internally used to create a sample sizes tensor. Note that at least one of mask or sample_sizes needs to be set.
non_uniform_dim (Optional[int], default: None) – Dimension in which the batch is non-uniform, default: 1

classmethod FromOversizeTensor(tensor, mask=None, sample_sizes=None, non_uniform_dim=None)[source]

Create a RaggedBatch instance from a tensor which is over-sized in the non-uniform dimension.

Over-sized means that the non-uniform dimension is larger than the maximum sample size in the batch.

Parameters:

tensor (Tensor) – Data to be stored (corresponding to the tensor tensor of RaggedBatch, see description above) except that the non-uniform dimension is larger than the maximum sample size in the batch. The tensor in the is truncated to the maximum sample size in the batch.
mask (Optional[Tensor], default: None) – Mask indicating which entries are valid (corresponding to the mask tensor of RaggedBatch, see description above). If not set, sample_sizes is internally used to create a mask. Note that at least one of mask or sample_sizes needs to be set. The mask is truncated to the maximum sample size in the batch.
sample_sizes (Optional[Tensor], default: None) – Number of valid entries for all samples (corresponding to the sample_sizes tensor of RaggedBatch, see description above). If not set, mask is internally used to create a sample sizes tensor. Note that at least one of mask or sample_sizes needs to be set.
non_uniform_dim (Optional[int], default: None) – Dimension in which the batch is non-uniform, default: 1

Return type:

RaggedBatch

Note

The number of batch dimensions is determined from the shape of the provided mask or sample_sizes tensor.

Warning

If both mask and sample_sizes are set, they need to be consistent with each other. This is not checked in the constructor. Inconsistent masks and sample sizes will lead to undefined behavior.

classmethod Empty(num_dims, non_uniform_dim, device, num_batch_dims=None, batch_shape=None)[source]

Create an empty instance.

The so created instance has a size of 0 along all dimensions.

Note

If neither num_batch_dims nor batch_shape is provided, the number of batch dimensions is 1 and the batch shape is (0,).

Parameters:

num_dims (int) – Total number of dimensions
non_uniform_dim (int) – The non-uniform dimension
device (Union[device, str]) – Device to use for the instance
num_batch_dims (Optional[int], default: None) – Number of batch dimensions. If provided, batch_shape cannot be set and size 0 is assumed for all batch dimensions.
batch_shape (Union[Sequence[int], int, None], default: None) – Shape of the batch (can be a sequence of ints or a single int in case of a single batch dimension). If not provided, the batch shape is (0,) * num_batch_dims. If provided, num_batch_dims cannot be set and the number of batch dimensions is inferred from the shape.

Returns:

RaggedBatch – The resulting empty RaggedBatch instance

classmethod FromFullTensor(full_tensor, non_uniform_dim=1, num_batch_dims=1)[source]

Create a RaggedBatch instance from a tensor representing a uniform-sized batch.

Parameters:

full_tensor (Tensor) – Tensor to convert into a RaggedBatch instance
non_uniform_dim (int, default: 1) – Dimension to use as the non-uniform dimension. Note that while in this special case, all dimensions are uniform, the non-uniform dimension has a special meaning (e.g. for get_non_uniform_dimension_transposed_to(), and many other functions) and needs to be set.
num_batch_dims (int, default: 1) – Number of batch dimensions in the tensor. Default: 1

Returns:

RaggedBatch – The resulting RaggedBatch instance containing the input tensor

property tensor: Tensor

Get the data tensor

See the description of RaggedBatch for more information on tensor.

For setting the data tensor, use set_tensor().

property mask: Tensor

Get the mask tensor

See the description of RaggedBatch for more information on mask.

The mask indicates which elements are valid (True) and which are not (False). It has dimensions: (*batch_dims_shape, max_sample_size).

property sample_sizes: Tensor

Get the sample sizes tensor

See the description of RaggedBatch for more information on sample_sizes.

The sample sizes tensor contains the actual sizes of each sample in the batch along the non-uniform dimension. Its dimensions are batch_dims_shape.

property non_uniform_dim: int: Get the non-uniform dimension

property num_batch_dims: int: Get the number of batch dimensions

property batch_shape: Size: Get the batch shape

property total_num_samples_in_batch: int: Get the total number of samples in the batch

property total_num_entries: int

Get the total number of entries.

This is the accumulated number of valid entries along the non-uniform dimension over all samples in the batch. This information is computed from the sample_sizes tensor when it is first accessed and re-used on subsequent calls.

property max_sample_size: int: Get the maximum sample size in the batch

as_self_with_cloned_data()[source]

Create a copy, where the data tensor (i.e. tensor) is cloned (while mask and sample sizes are shared)

Return type:: RaggedBatch

create_with_sample_sizes_like_self(tensor, non_uniform_dim=None, device=None)[source]

Create a RaggedBatch instance with the same batch shape and sample sizes as this

Note that while the sample sizes are the same, the total number of dimensions, the non-uniform dimension, and the size of the data tensor except in the batch and the non-uniform dimensions may be different.

Parameters:

tensor (Tensor) – Data to set for the new instance (padded tensor)
non_uniform_dim (Optional[int], default: None) – Non-uniform dimension (in tensor). Can be set to None to use the same dimension as this. Default: None
device (Union[device, str, None], default: None) – Device on which to create the resulting RaggedBatch instance. If not provided, the device of the input tensor is used.

Returns:

RaggedBatch – Resulting RaggedBatch instance with the same batch shape and sample sizes as this.

get_non_uniform_dimension_transposed_to(dim)[source]

Get with the non-uniform dimension transposed to a given dimension.

If the given dimension is already the non-uniform dimension, self is returned.

Info:: The non-uniform dimension cannot be set to a batch dimension (i.e., any dimension < num_batch_dims).

Parameters:: dim (int) – Dimension to transpose the current non-uniform dimension to
Returns:: RaggedBatch – Resulting RaggedBatch instance

get_existence_weights(dtype=torch.float32)[source]

Get the existence weights

The existence weights are 1.0 for the contained entries (i.e. entries corresponding to actual data as opposed to padded fillers) and 0.0 for filler entries.

In contrast to self.mask, the dimensionality and shape of the weights correspond to the dimensionality and shape of the data. This means that the mask can be directly applied to the data tensor (i.e. tensor), regardless of the number of dimensions or which dimension is the non-uniform dimension.

Parameters:: dtype (dtype, default: torch.float32) – Type for the existence weights. Default: torch.float32
Returns:: Tensor – The resulting weights tensor

with_padded_set_to(value_to_set)[source]

Set filler/padded entries in the data (i.e. tensor) to a fixed value.

Note

This operation is not performed in-place, i.e. this.tensor is not changed. For an in-place operation, use set_padded_to() instead.

Parameters:: value_to_set (float) – Value to set for padded entries.
Returns:: RaggedBatch – Like self, but with the padded values set

set_padded_to(value_to_set)[source]

Set filler/padded entries in the data tensor (i.e. tensor) to a fixed value in-place.

Note

Note that as this operation is in-place. This means that this.tensor is changed. No new RaggedBatch instance is created. If this is not desired, use with_padded_set_to() instead, which is not in-place and returns a new RaggedBatch instance.

Parameters:: value_to_set (float) – Value to set for padded entries.
Return type:: None

repeat_samples(num_repeats, batch_dim=None)[source]

Repeat along a single batch dimension

Parameters:

num_repeats (Union[int, Sequence[int]]) – Number of times to repeat. In case of a single value, the dimension in which to repeat is specified by batch_dim. In case of a sequence, the sequence needs to have the same length as the number of batch dimensions and batch dim must not be set.
batch_dim (Optional[int], default: None) – Which batch dimension to repeat along. Can only be set if num_repeats is a single value. If not set (and num_repeats is a single value), 0 is used.

Returns:

RaggedBatch – Resulting RaggedBatch instance with the samples repeated

unsqueeze_batch_dim(dim)[source]

Unsqueeze a batch dimension

Important

The dimension to unsqueeze has to be among the batch dimensions (including adding a new batch dimension after the currently last batch dimensions, i.e. dim=self.num_batch_dims).

For unsqueezing a data dimension, use unsqueeze_data_dim() instead.

Note

As the batch dimensions are always before the non-uniform dimension, the non-uniform dimension is shifted by 1 accordingly.

Example

>>> example_batch.num_batch_dims
2
>>> example_batch.non_uniform_dim
4
>>> example_batch_unsqueezed = example_batch.unsqueeze_batch_dim(1)
>>> example_batch_unsqueezed.non_uniform_dim
5

Parameters:: dim (int) – Batch dimension to add. Has to be in range [0, num_batch_dims].
Returns:: RaggedBatch – Resulting RaggedBatch instance with the batch dimension added

squeeze_batch_dim(batch_dim)[source]

Squeeze a batch dimension

Note

This operation is not performed in-place, i.e. this.tensor is not changed.

Parameters:: batch_dim (int) – Batch dimension to squeeze. Has to be in range [0, num_batch_dims).
Returns:: RaggedBatch – Resulting RaggedBatch instance with the batch dimension squeezed

reshape_batch_dims(new_batch_shape)[source]

Reshape the batch dimensions

Note

This operation is not performed in-place, i.e. this.tensor is not changed.

Important

The non-uniform dimension is adjusted to the new batch shape.

Parameters:: new_batch_shape (Union[int, Tuple[int, ...]]) – New batch shape
Returns:: RaggedBatch – Resulting RaggedBatch instance with the batch dimensions reshaped

flatten_batch_dims()[source]: Flatten the batch dimensions :rtype: RaggedBatch

Note

This operation is not performed in-place, i.e. this.tensor is not changed.

broadcast_batch_dims_to_shape(new_batch_shape)[source]

Return type:: RaggedBatch

static broadcast_batch_dims(data)[source]

Broadcast the batch dimensions of a sequence of RaggedBatch instances to common batch dimensions.

Parameters:: data (Sequence[RaggedBatch]) – Sequence of RaggedBatch instances
Returns:: Sequence[RaggedBatch] – Sequence of RaggedBatch instances with the batch dimensions broadcasted to the common batch dimensions

to_device(device)[source]

Get on device

Return type:: RaggedBatch

cpu()[source]

Get on the CPU

Return type:: RaggedBatch

to_dtype(dtype)[source]

Get with tensor converted to given data type

Return type:: RaggedBatch

detach()[source]

Get with detached tensor

Return type:: RaggedBatch

apply(proc_step)[source]

Apply a function to tensor and get results as new RaggedBatch instance(s).

See the proc_step parameter for requirements for the used function.

Important

It is important to make sure that the tensors returned by proc_step fulfill the output requirements regarding the non-uniform dimension, sample sizes, and regarding the valid entries being stored first (i.e. lower indices), followed by filler values along the non-uniform dimension to ensure that the resulting RaggedBatch instances are correct. See the proc_step parameter for more details.

Parameters:

proc_step (Union[Callable[[Tensor], Union[Tensor, Tuple[Tensor, ...]]], Callable[[Tensor, Tensor], Union[Tensor, Tuple[Tensor, ...]]], Callable[[Tensor, Tensor, Tensor], Union[Tensor, Tuple[Tensor, ...]]]]) –

Function to process the data tensor. All the defined inputs (see below) are expected to be positional arguments.

param tensor:: Will contain tensor of this
param mask:: If part of the function signature, will contain mask of this
param sample_sizes:: As a positional argument, this can only be part of the function signature if mask is. If used, will contain sample_sizes of this
returns:: Either a tensor or a tuple of tensors. For each tensor, a RaggedBatch instance will be output from apply(). Note that for each returned tensor, the non-uniform dimension, as well as the number of entries along that dimension, must correspond to this. Also, for each sample, the the valid entries must be located before any filler entries along the non-uniform dimension (as is in general the case for the data stored in a RaggedBatch, see documentation of the class). Note that the last point is generally fulfilled if no permutations are applied to the data tensor, as the input tensor contains valid entries first, followed by filler entries.

Returns:

Union[RaggedBatch, Tuple[RaggedBatch, ...]] – RaggedBatch instance or tuple of RaggedBatch instances (depending on the output of proc_step), with the function applied to the data (i.e. to tensor).

set_tensor(tensor)[source]

Set tensor.

Important

The batch shape, the non-uniform dimension, and the number of entries along that dimension must correspond to this. Also, for each sample, the valid entries must be located before any filler entries (as is in general the case for the data stored in a RaggedBatch instance, see documentation of the class).

Parameters:: tensor (Tensor) – Data tensor to set

split()[source]

Split contained data (i.e. the data in tensor) into individual samples.

The batch dimensions are preserved in the nested list structure. For example, if the batch shape is (2, 3), the result will be a list of 2 lists, each containing 3 tensors.

The returned samples are cropped to not contain any filler entries. This means that the returned tensors correspond to the actual sample sizes.

Example

In the example below, the split() operation is applied to a RaggedBatch instance with a batch size of 4 (single batch dimension) and a maximum sample size of 3, resulting in a list of 4 tensors, and each tensor corresponding to a single sample without padded filler entries. Note that in the image below:

Letters indicate data entries that are valid (i.e. correspond to the actual data).

‘*’ Indicates padded filler entries (i.e. invalid entries) in the data.

Each depicted element may represent a single value (corresponding to scalar data and 0 data dimensions), or itself represent a non-scalar entry (in case for one or more data dimensions).

Returns:

Union[List[Tensor], List[List]] – The individual samples in a nested list structure that reflects the original batch shape. The individual tensors correspond to the actual sample sizes, and do not contain padded filler entries.

For a single batch dimension, returns a flat list of tensors. For multiple batch dimensions, returns a nested list structure mirroring the batch dimensions.

unsqueeze_data_dim(dim)[source]

Unsqueeze the data tensor (i.e. tensor) along a dimension.

Important

The dimension to unsqueeze has to be after the batch dimensions (including adding a new data dimension right after the batch dimensions, i.e. dim=self.num_batch_dims).

For unsqueezing a batch dimension, use unsqueeze_batch_dim() instead.

Note

If the new dimension is inserted before the current non-uniform dimension, the non-uniform dimension is shifted by 1 accordingly.

Example

>>> example_batch.num_batch_dims
1
>>> example_batch.non_uniform_dim
1
>>> example_batch_unsqueezed = example_batch.unsqueeze_data_dim(1)
>>> example_batch_unsqueezed.non_uniform_dim
2

Parameters:: dim (int) – Dimension index into which to insert the new dimension
Returns:: RaggedBatch – Like self, but with the new dimension added, and the non-uniform dimension shifted accordingly if needed

__getitem__(item)[source]

Item read access for tensor

This is a shorthand for: … = self.tensor[item].

Note than as such, this allows for access to filler entries and does not check whether the accessed elements correspond to valid or filler entries.

Return type:: Tensor

__setitem__(item, value)[source]

Item write access for tensor

This is a shorthand for: self.tensor[item] = ….

Note than as such, this allows for access to filler entries and does not check whether the accessed elements correspond to valid or filler entries.

Return type:: None

property device: device: Get the used device

property shape: Size

Get the shape of the data tensor (i.e. tensor)

The non-uniform dimension is reported as the size of the underlying tensor, i.e. to the maximum size among all samples.

property dtype: dtype: Type of the data elements (i.e. elements of tensor)

property requires_grad: bool: Get/set whether tensor requires gradients

retain_grad()[source]

Retain gradients for tensor

Return type:: None

property retains_grad: bool: Get whether gradients are retained for tensor

size(*args, **kwargs)[source]: Shorthand for self.tensor.size(*args, **kwargs)

dim()[source]

Get the number of dimensions (of tensor)

Return type:: int

int()[source]