Example

Here, we provide an example of how to use the batching-helpers package to implement object detection loss, including

Handling of per-sample (i.e. non-batched) input data
Matching between predictions and ground truth (GT) objects as a pre-requisite for the actual loss computation
Loss computation of different types:
- Based on direct object-to-object comparisons (in this example: classification and bounding box regression losses)
- Computed for all predictions, but utilizing the matching results (in this example: existence loss)

The implementation of the loss computation is fully shown in the code snippets in this document. The complete implementation (including helpers providing example input data to actually run the code) can be found in the example folder of the batching-helpers package.

Important

You can run the example using the script packages/batching_helpers/example/example.py.

Overview

The loss computation implementation consists of three main steps:

Conversion of ground truth per-sample data into RaggedBatch instances
Matching predictions to ground truth objects
Loss computation

The following code snippet demonstrates this high-level approach. Step (1) is fully covered here, while steps (2) and (3) are detailed in subsequent sections.

packages/batching_helpers/example/example.py

# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import torch

# Import the batching-helpers package
import accvlab.batching_helpers as batching_helpers

# Import the matcher and loss computation modules (parts of the example implementation)
from matcher import Matcher
from loss_computation import LossComputation

# Import the example input data (helper for running the example)
import input_data


def loss_computation_main(rects_gt, classes_gt, rects_pred, classes_pred, pred_existence, weights_gt):

    # ===== Step 1: Conversion of the GT per-sample data to RaggedBatch instances =====

    # @NOTE
    # Typically, the ground truth (GT) is provided as a list containing per-sample GT data as individual
    # tensors. Here, this format is converted into RaggedBatch objects containing the whole batch.
    # Note that except for the first call, a `other_with_same_sample_sizes` parameter is present. This
    # is optional, but saves memory by re-using the `mask` and `sample_sizes` (see `RaggedBatch`
    # documentation) of the first created instance. This is possible as all the GT data refers to the same
    # objects, so that for a given sample, the number of objects is the same for the different types of GT
    # data.
    rects_gt_compact = batching_helpers.combine_data(rects_gt)
    classes_gt_compact = batching_helpers.combine_data(
        classes_gt, other_with_same_sample_sizes=rects_gt_compact
    )
    weights_gt_compact = batching_helpers.combine_data(
        weights_gt, other_with_same_sample_sizes=rects_gt_compact
    )

    # ===== Step 2: Matching of the predictions to the GT objects =====

    # @NOTE
    # Get the matches for the individual samples. `matched_gt_indices` and `matched_pred_indices` contain
    # indices for matches for the GT and predictions, respectively. As each sample contains a different number
    # of matches, `RaggedBatch` instances are used to store the indices for both the GT and the predictions.
    matcher = Matcher()
    matched_gt_indices, matched_pred_indices = matcher(
        rects_gt_compact, classes_gt_compact, rects_pred, classes_pred
    )

    # ===== Step 3: The actual loss computation =====

    # @NOTE
    # Compute the actual loss given GT and prediction data, as well as the matches established by the matcher.
    loss_comp = LossComputation()
    per_sample_loss = loss_comp(
        rects_gt_compact,
        classes_gt_compact,
        rects_pred,
        classes_pred,
        pred_existence,
        weights_gt_compact,
        matched_gt_indices,
        matched_pred_indices,
    )

    # @NOTE
    # The loss computation returns per-sample losses, and they can be used as such after the computation
    # (e.g. logged, weighted, etc.). Here, we just sum the per-sample losses to obtain the final loss.
    final_loss = torch.sum(per_sample_loss)
    return final_loss


if __name__ == "__main__":
    loss = loss_computation_main(
        input_data.rects_gt,
        input_data.classes_gt,
        input_data.rects_pred,
        input_data.classes_pred_onehot,
        input_data.pred_existence,
        input_data.weights_gt,
    )
    print(f"Loss: {loss}")

Matcher

Efficient Implementation Approach

The matcher implementation is designed to be efficient on the GPU. The matching consists of two steps, namely the cost matrix computation and the Hungarian matching based on the costs. As the matching itself is on the CPU (and remains non-batched), performance gains are mainly achieved through the batched cost matrix computation.

The cost matrices are structured as follows: For each sample, the cost matrix denotes the cost of each possible match between a prediction and a GT object. For example, for a match of prediction i and a GT object j, the cost is cost_matrix[i, j]. This means that for each sample, the cost matrix is of size (num_predictions, num_gt_objects), and each element is computed from one prediction and one GT object.

Non-batched Approach

The following figure shows typical non-batched cost matrix computation:

In the illustration, the different colors represent the individual samples, and each sample corresponds to one computation iteration. Note that:

The GT data is of variable size (and therefore stored as a list of per-sample tensors, not a single tensor)

Due to the variable GT size, the sizes of the cost matrices are also variable in the dimension iterating over the GT objects (dim==1; horizontal axis in the figure)

Due to this variable size, batched implementation is challenging and in practice, the cost matrix computation is often implemented in a non-batched manner.

Batched Approach

The following figure illustrates how the matching can be implemented in a batched manner using the batching-helpers package:

Gray elements represent filler values that allow uniform batch processing while preserving variable ground truth sizes. The RaggedBatch class as well as the available helper functions handle these values automatically. Please refer to the API documentation for details.

The key implementation principles to achieve batched processing are:

Ground truth data for all samples is stored in a single RaggedBatch instance for batched processing with variable sizes (as shown in the figure)
Cost matrices also use RaggedBatch format with the non-uniform dimension being the dimension iterating over the ground truth elements (as shown in the figure)
Handling of the non-uniform size:
- During the cost matrix computation, uniform sample sizes are assumed (i.e. no differentiation between data and filler values), enabling the use of standard PyTorch operations or e.g. already implemented custom implementations of batched cost functions
- After the computation, the results are wrapped in a RaggedBatch instance, which enables easy handling of the filler values. Here, the samples sizes of the input GT data can be re-used, so that they do not need to be set up manually.

Note that this approach means that computations are also performed for the filler values, which leads to some overhead. However, this overhead is typically much smaller than the gains of the batched implementation, which reduces the CPU (Python) overhead and improves the GPU utilization for the individual operations.

Implementation

The matcher implementation is shown in the following snippet, with the core functionality residing in the __call__() method. The matcher employs various cost functions. These cost functions do not explicitly handle non-uniform batches, instead assuming a fixed size for the individual samples. As discussed above, this means that existing batched implementations of such cost functions can be readily re-used.

The handling of non-uniform batches in the resulting cost matrices is achieved by wrapping the results as RaggedBatch instances, where the number of valid GT objects is known from the input GT data (see the comments in __call__() for implementation specifics).

Note: The core matching operation (scipy.optimize.linear_sum_assignment()) is performed on the CPU and remains non-batched. The batching-helpers package facilitates integration of non-batched operations through split() and combine_data() functions.

packages/batching_helpers/example/matcher.py

# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import torch
import accvlab.batching_helpers as batching_helpers
from scipy.optimize import linear_sum_assignment


class Matcher:

    def __call__(self, rects_gt, classes_gt, rects_pred, classes_pred):
        # @NOTE
        # Get the cost matrices denoting the cost for each GT to prediction combination. Note that as the
        # samples in the GT data are padded to uniform size (see documentation of `RaggedBatch.tensor`), the
        # same will be true for the matrices.
        batch_size = rects_gt.shape[0]
        iou_cost_matrices = self._iou_cost_func(rects_gt.tensor, rects_pred)
        class_cost_matrices = self._class_l1_cost_func_gt_labels(classes_gt.tensor, classes_pred)
        total_cost_matrices = iou_cost_matrices + class_cost_matrices

        # @NOTE
        # During cost matrix computation, we assume uniform batch size (and use filler values). However, the
        # valid cost matrices are non-uniform in size. Along `dim==2` (iterating over the GT objects), the
        # sample sizes correspond to the sample sizes of the GT inputs (there, along `dim==1`). Create a
        # RaggedBatch containing the matrices. Note that this will correctly handle the filler regions in the
        # matrices, as they exactly correspond to the format used in `RaggedBatch.tensor`. This is as follows:
        #   - In the input data to the matrix computations originally from `RaggedBatch` instances, the filler
        #     values are in the correct format (i.e. always after the valid data)
        #   - The matrix computations do not perform any permutations of the data, so that the filler values
        #     remain in the same locations (but along a different dimension)
        total_cost_matrices = classes_gt.create_with_sample_sizes_like_self(
            total_cost_matrices, non_uniform_dim=2
        )

        # @NOTE
        # The Hungarian matching is done on the CPU one sample at a time. Therefore, move the data to the CPU
        # and split RaggedBatch instances containing the cost matrices into individual samples. Note that
        # `split()` removes the filler value padding, so that the valid matrices with correct sample sizes are
        # obtained.
        device_cpu = torch.device("cpu")
        total_cost_matrices_cpu = total_cost_matrices.to_device(device_cpu)
        total_cost_matrices_list = total_cost_matrices_cpu.split()

        # @NOTE: Perform matching for each sample
        matched_gt_index_list = [None] * batch_size
        matched_pred_index_list = [None] * batch_size
        for i, cost_mat in enumerate(total_cost_matrices_list):
            m_pred, m_gt = linear_sum_assignment(cost_mat)
            matched_gt_index_list[i] = torch.tensor(m_gt, dtype=torch.int64, device=device_cpu)
            matched_pred_index_list[i] = torch.tensor(m_pred, dtype=torch.int64, device=device_cpu)

        # @NOTE
        # Combine resulting indices for the individual samples into RaggedBatch instances representing the
        # whole batch.
        matched_gt_indices = batching_helpers.combine_data(matched_gt_index_list)
        matched_pred_indices = batching_helpers.combine_data(
            matched_pred_index_list, other_with_same_sample_sizes=matched_gt_indices
        )
        # @NOTE: Move results to the GPU
        matched_gt_indices = matched_gt_indices.to_device(device=rects_gt.device)
        matched_pred_indices = matched_pred_indices.to_device(device=rects_gt.device)

        return matched_gt_indices, matched_pred_indices

    # Example batched cost function for the matcher. It is used in the example, but the implementation
    # of this function is not the focus of the example.
    @staticmethod
    def _iou_cost_func(rects_gt, rects_pred, eps=1e-6):

        # With broadcasting, using the `_ext` variants will lead to pair-wise results for all possible
        # combinations
        rects_gt_ext = rects_gt.unsqueeze(1)
        rects_pred_ext = rects_pred.unsqueeze(2)

        areas_gt = torch.prod(rects_gt_ext[..., 2:4] - rects_gt_ext[..., 0:2], axis=-1)
        areas_pred = torch.prod(rects_pred_ext[..., 2:4] - rects_pred_ext[..., 0:2], axis=-1)

        rects_gt_ul = rects_gt_ext[..., 0:2]
        rects_gt_lr = rects_gt_ext[..., 2:4]
        rects_pred_ul = rects_pred_ext[..., 0:2]
        rects_pred_lr = rects_pred_ext[..., 2:4]

        intersections_ul = torch.max(rects_gt_ul, rects_pred_ul)
        intersections_lr = torch.min(rects_gt_lr, rects_pred_lr)
        sizes_intersections = intersections_lr - intersections_ul
        sizes_intersections[sizes_intersections < 0.0] = 0.0
        areas_intersections = torch.prod(sizes_intersections, axis=-1)

        areas_union = areas_gt + areas_pred - areas_intersections
        areas_union[areas_union < eps] = eps

        res = 1.0 - areas_intersections / areas_union

        return res

    # Example batched cost function for the matcher. It is used in the example, but the implementation
    # of this function is not the focus of the example.
    @staticmethod
    def _class_l1_cost_func_gt_labels(classes_gt, classes_pred_one_hot):

        # Internal helper function
        def class_l1_cost_func_gt_one_hot(classes_gt_one_hot, classes_pred_one_hot):
            prod = torch.einsum('bik,bjk->bij', classes_pred_one_hot, classes_gt_one_hot)
            cost = 1.0 - prod
            return cost

        # Note: This part of the loss computation is not computed in a batched manner. However, this
        # is not the focus of the example and in an actual application, the loss can be implemented
        # differently (e.g. custom extension).
        num_classes = classes_pred_one_hot.shape[-1]
        batch_size = classes_gt.shape[0]
        res = [None] * batch_size
        for s, gt in enumerate(classes_gt):
            res_s = torch.zeros((gt.shape[0], num_classes), dtype=torch.float32, device=gt.device)
            for i, cls in enumerate(gt):
                res_s[i, cls] = 1.0
            res[s] = res_s
        classes_gt_one_hot = torch.stack(res, dim=0)
        # end of non_batched part
        cost = class_l1_cost_func_gt_one_hot(classes_gt_one_hot, classes_pred_one_hot)
        return cost

Loss Computation

Efficient Implementation Approach

Similar to the matcher, the efficiency is improved by enabling batching where it was previously challenging to achieve. For most loss types, the loss is computed by an element-wise (i.e. object for object) comparison between the predictions and the GT objects. Here, the corresponding (according to the matching) GT and prediction objects need to be extracted first, followed by the actual loss computation.

Note that the existence loss is computed differently, as it is not based on a direct object-to-object comparison. The existence loss is not discussed here, but it also benefits from batched implementation in a similar way. It is part of the example implementation, so please refer to the code snipped further below for details.

Non-batched Approach

The loss computation is comprised of two steps. First, the corresponding objects for the predictions and the GT are extracted.

Ground truth object extraction at matched indices:

Note that here, both the GT objects and the indices are lists of tensors. Similar to the matcher, different samples are indicated by different colors, and are typically processed sequentially, one sample at a time.

Similarly, the predictions at the matched indices are extracted as follows:

This step is very similar to the GT object processing shown above. A notable difference is that the predictions are stored as a single tensor, as the predictions are outputs of the trained model and their number is typically fixed. However, as the number of matches varies between samples, the indices are stored as a list of tensors, preventing the use of a single tensor in the output.

Finally, the loss is computed by comparing the predictions and the GT objects.

Batched Approach

Similarly to the matcher, the loss computation is done in a batched manner by using the RaggedBatch format.

The extraction of the GT objects is done as follows:

Similarly, the predictions are extracted as follows:

Finally, the loss is computed by comparing the predictions and the GT objects.

Note that all operations are performed in a batched manner. For the indexing operation, the function batched_indexing_access() is used. Similar to the matcher, we also process filler values in the loss function(s), which leads to some overhead. However, this is typically far outweighed by the performance gains of the batched implementation.

Here, we discussed the loss implementation as is e.g. used in the classification and bounding box regression losses in the implementation above. Note that e.g. the existence loss follows a different approach. However, the same principles apply.

Implementation

The loss function takes two key inputs:

Ground truth objects and predictions (same as matcher)
Matching results mapping predictions to corresponding ground truth objects

Loss functions operate on batched data assuming uniform sample sizes (similar to the cost functions employed by the matcher), allowing direct reuse of existing batched implementations. See the __call__() method comments for implementation details.

packages/batching_helpers/example/loss_computation.py

# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import torch
import accvlab.batching_helpers as batching_helpers


class LossComputation:

    def __call__(
        self,
        bboxes_gt,
        classes_gt,
        bboxes_pred,
        classes_pred,
        existence_pred,
        weights_gt,
        matches_gt,
        matches_pred,
    ):
        # @NOTE
        # Extract matched ground truth and prediction data using the indices from matching.
        # This creates element-wise correspondences between GT and prediction objects,
        # enabling direct comparison in subsequent loss computations.
        # See `batching_helpers.batched_indexing_access()` documentation for details.
        cls_gt_matched = batching_helpers.batched_indexing_access(classes_gt, matches_gt).to_dtype(
            torch.int64
        )
        cls_pred_matched = batching_helpers.batched_indexing_access(classes_pred, matches_pred)
        bbxs_gt_matched = batching_helpers.batched_indexing_access(bboxes_gt, matches_gt)
        bbxs_pred_matched = batching_helpers.batched_indexing_access(bboxes_pred, matches_pred)
        weights_matched = batching_helpers.batched_indexing_access(weights_gt, matches_gt)

        # @NOTE
        # Compute (per-object) losses. Note that this is a batched operation and furthermore, that the
        # loss functions themselves are not specifically implemented for non-uniform batches and do not
        # distinguish between actual objects and filler entries in the data. This means that
        # in a real use-case, already available batched loss functions can be readily re-used.
        #
        # Also, please note that the loss functions do not reduce over the individual objects/targets.
        # This enables us to wrap the per-object losses as `RaggedBatch` instances and use the
        # `RaggedBatch` and `batching-helpers` functionality to handle the non-uniform sample sizes (e.g.
        # when summing/averaging over the valid entries only).
        #
        # Note that other ways of handling the padded entries are also possible if the loss functions do
        # reduce over the objects. One possible way is to provide appropriate (0.0) weights for the padded
        # entries (however, be cautious of potential NaN values when using this approach).
        class_per_object_loss_data = self._per_object_class_l1_loss_labels_gt(
            cls_gt_matched.tensor, cls_pred_matched.tensor, weights_matched.tensor
        )
        bbox_per_object_loss_data = self._per_object_bbox_overlap_loss(
            bbxs_gt_matched.tensor, bbxs_pred_matched.tensor, weights_matched.tensor
        )

        # @NOTE
        # Wrap the per-object losses as `RaggedBatch` instances. Similarly to the cost matrices in the
        # matcher, this can be done as the filler elements in the loss tensors are located where the
        # `RaggedBatch` implementation expects them (as the filler locations in the loss computation inputs
        # were defined by the `RaggedBatch` instances containing the input data, and no permutations of
        # objects are performed in the loss computation).
        class_per_object_loss = cls_gt_matched.create_with_sample_sizes_like_self(
            class_per_object_loss_data, non_uniform_dim=1
        )
        bbox_per_object_loss = bbxs_gt_matched.create_with_sample_sizes_like_self(
            bbox_per_object_loss_data, non_uniform_dim=1
        )

        # @NOTE
        # Sum up loss for the individual objects. As the loss functions do not explicitly handle the padded
        # entries, the loss computation is also performed for those. This means that the filler entries may
        # contain non-zero values (including `NaN`). Therefore, the filler values would potentially influence
        # the sum if taken into consideration. This means we cannot use `torch.sum()` directly. Instead, we
        # use the `sum_over_targets()` function provided by the `batching-helpers` package.
        class_loss = batching_helpers.sum_over_targets(class_per_object_loss)
        bbox_loss = batching_helpers.sum_over_targets(bbox_per_object_loss)

        # @NOTE
        # Compute existence loss next. This loss is different from the other losses in that the computation is
        # done for all predictions, not only the matched ones.

        # @NOTE
        # First, create a mask which is `True` for existing (matched) targets and `False` for non-existent
        # ones. The mask is created from the indices of the matched predictions (also see the
        # `batching_helpers.get_mask_from_indices()` documentation).
        existence_mask = batching_helpers.get_mask_from_indices(existence_pred.shape[1], matches_pred)

        # @NOTE
        # Additionally, compute the overlap (`1.0 - bbox_per_target_loss`) and use it as a weight in the
        # existence loss (in combination with the weights from `weights_gt`) as follows:
        #   - Use the so computed weights directly for the matched objects
        #   - Compute average value and use it for the non-matched objects
        # In addition, apply a compensation factor between existing and non-existing objects to the
        # non-matched objects in order to account for the imbalance.
        #
        # To obtain the overall weights used for all predictions, the following steps are performed:
        # 1. Compute the overlap weights for the matched objects (from `bbox_per_object_loss`)
        # 2. Combine the overlap weights with `weights_matched` (which contains the values from `weights_gt`
        #    for the matched objects) to obtain `existence_weights_matched`.
        # 3. Map the resulting `existence_weights_matched` back to all predictions & also set the weights for
        #    non-existent (i.e. non-matched) predictions in the process. This is done as follows:
        #   a) First compute the per-sample mean values of `existence_weights_matched` (averaging over the
        #      existing objects) to obtain `weights_means`.
        #   b) Then, compute per-sample `imbalance_factors` compensating for the imbalance between existing
        #      and non-existing objects.
        #   c) Multiply the `weights_means` with `imbalance_factors` to obtain `weights_mean_adjusted`.
        #   d) Initialize `existence_weights_preds` (which contains the weights for all predictions and is of
        #      corresponding shape) with the values from `weights_mean_adjusted`. These initial values are the
        #      weights for the non-matched predictions.
        #   e) Write the values from `existence_weights_matched` into
        #      `batching_helpers.batched_indexing_write()` for the matched predictions (i.e. use the weights
        #      in `existence_weights_matched` for those), while leaving the other values (i.e. non-matched)
        #      unchanged.
        #
        # The points above are implemented as follows:

        # @NOTE
        # 1. Compute the overlap weights for the matched bboxes (from `bbox_per_object_loss`).
        #
        # Note the use of the `apply()` convenience method to apply a function to the data tensor (i.e.
        # `tensor`) of the `RaggedBatch` instance. The line:
        #   >>> overlap_weights_matched = bbox_per_object_loss.apply(lambda tensor: 1.0 - tensor)
        # is equivalent to:
        #   >>> tensor = bbox_per_object_loss.tensor
        #   >>> tensor = 1.0 - tensor
        #   >>> overlap_weights_matched = bbox_per_object_loss.create_with_sample_sizes_like_self(tensor)
        # Note that the `apply()` method returns a new `RaggedBatch` instance. Also, the passed function
        # may accept more than one argument, in which case `sample_sizes` and `mask` are also passed to
        # the function (but should not be modified). Please refer to the documentation of
        # `RaggedBatch.apply()` for more details.
        overlap_weights_matched = bbox_per_object_loss.apply(lambda tensor: 1.0 - tensor)

        # @NOTE
        # 2. Combine the overlap weights with `weights_matched` (which contains the values from `weights_gt`
        #    for the matched objects).
        #
        # Note that here, data tensors of two `RaggedBatch` instances are processed in the lambda function.
        # As both `RaggedBatch` instances represent the same sample sizes and non-uniform dimension, it does
        # not matter which one calls the `apply()` method and for which one the data tensor is accessed as
        # `.tensor`.
        existence_weights_matched = weights_matched.apply(
            lambda tensor: tensor * overlap_weights_matched.tensor
        )

        # @NOTE
        # 3a). First compute the per-sample mean values of `existence_weights_matched` (averaging over the
        #      existing objects) to obtain `weights_means`.
        #
        # As the target dimension is padded, `torch.mean()` cannot be used both
        #   - for the reasons discussed above for summation over objects (i.e. the number of actual objects
        #     does not necessarily correspond to the tensor size)
        #   - because `torch.mean()` would divide the sum by a wrong number of elements for samples containing
        #     filler elements
        # Instead, we use the method provided by the `batching-helpers` package:
        weights_means = batching_helpers.average_over_targets(existence_weights_matched)

        # @NOTE
        # 3b). Then, compute per-sample `imbalance_factors` compensating for the imbalance between existing
        #      and non-existing objects.
        #
        # First, obtain the number of predictions.
        num_preds = bboxes_pred.shape[1]
        # @NOTE
        # Then, compute the imbalance correction factor as follows:
        #   - Divide by the number of non-existent targets
        #     (i.e. `num_preds - overlap_weights_matched.sample_sizes`)
        #   - Multiply by the number of existing targets (i.e. `overlap_weights_matched.sample_sizes`)
        # Note that the `nan_to_num()` function is used to handle the case where the number of non-existent
        # targets is zero.
        imbalance_factors = torch.nan_to_num(
            overlap_weights_matched.sample_sizes / (num_preds - overlap_weights_matched.sample_sizes), 0.0
        )

        # @NOTE
        # 3c). Multiply the `weights_means` with `imbalance_factors` to obtain `weights_mean_adjusted`.
        weights_mean_adjusted = weights_means * imbalance_factors

        # @NOTE
        # 3d). Initialize `existence_weights_preds` (which contains the weights for all predictions and is of
        #      corresponding shape) with the values from `weights_mean_adjusted`. These initial values are the
        #      weights for the non-matched predictions.
        existence_weights_preds = weights_mean_adjusted.unsqueeze(-1).repeat(1, classes_pred.shape[1])

        # @NOTE
        # 3e). Write the values from `existence_weights_matched` into `existence_weights_preds` for the
        #      matched predictions (i.e. use the weights in `existence_weights_matched` for those), while
        #      leaving the other values unchanged.
        #
        # Note that the `batched_indexing_write()` function is equivalent to `__setitem__()` for the unbatched
        # (single-sample) case using the build-in tensor indexing operator.
        existence_weights_preds = batching_helpers.batched_indexing_write(
            existence_weights_matched, matches_pred, existence_weights_preds
        )

        # @NOTE
        # Compute existence loss (considering all predictions, not only the matched ones).
        # Note that the loss has uniform size, and therefore we can directly use `torch.sum()`
        # to sum over the objects.
        existence_per_object_loss = self._per_object_existence_loss(
            existence_pred, existence_mask, existence_weights_preds
        )
        existence_loss = torch.sum(existence_per_object_loss, 1)

        # @NOTE
        # Sum up all losses & return result.
        loss = class_loss + bbox_loss + existence_loss
        return loss

    # Example loss function for the loss computation. This is not the focus of the example.
    @staticmethod
    def _per_object_class_l1_loss_labels_gt(classes_gt, classes_pred, weights):

        def per_object_class_l1_loss_one_hot_gt(classes_gt, classes_pred, weights):

            diff = torch.abs(classes_gt - classes_pred)
            weighted_diff = weights.unsqueeze(-1) * diff

            # Compute the sum over the classes
            res = torch.sum(weighted_diff, dim=2)

            return res

        # Note: This part of the loss computation is not batched. However, we do not focus on loss
        # function implementation here and in a practical application, the loss can be implemented
        # differently (e.g. custom PyTorch extension).
        num_classes = classes_pred.shape[-1]
        batch_size = classes_gt.shape[0]
        res = [None] * batch_size
        for s, gt in enumerate(classes_gt):
            res_s = torch.zeros((gt.shape[0], num_classes), dtype=torch.float32, device=gt.device)
            for i, label in enumerate(gt):
                res_s[i, label] = 1.0
            res[s] = res_s
        classes_gt_one_hot = torch.stack(res, dim=0)
        # end of non_batched part
        res = per_object_class_l1_loss_one_hot_gt(classes_gt_one_hot, classes_pred, weights)
        return res

    # Example batched loss function. It is used in the example, but the implementation
    # of this function is not the focus of the example.
    @staticmethod
    def _per_object_bbox_overlap_loss(bboxes_gt, bboxes_pred, weights, eps=1e-6):
        areas_gt = torch.prod(bboxes_gt[..., 2:4] - bboxes_gt[..., 0:2], axis=-1)
        areas_pred = torch.prod(bboxes_pred[..., 2:4] - bboxes_pred[..., 0:2], axis=-1)

        rects_gt_ul = bboxes_gt[..., 0:2]
        rects_gt_lr = bboxes_gt[..., 2:4]
        rects_pred_ul = bboxes_pred[..., 0:2]
        rects_pred_lr = bboxes_pred[..., 2:4]

        intersections_ul = torch.max(rects_gt_ul, rects_pred_ul)
        intersections_lr = torch.min(rects_gt_lr, rects_pred_lr)
        sizes_intersections = intersections_lr - intersections_ul
        sizes_intersections[sizes_intersections < 0.0] = 0.0
        areas_intersections = torch.prod(sizes_intersections, axis=-1)

        areas_union = areas_gt + areas_pred - areas_intersections
        areas_union[areas_union < eps] = eps

        target_loss = 1.0 - areas_intersections / areas_union

        target_loss = target_loss * weights

        return target_loss

    # Example batched loss function. It is used in the example, but the implementation
    # of this function is not the focus of the example.
    @staticmethod
    def _per_object_existence_loss(existence_pred, existence_mask, weights):
        existence_gt = existence_mask.to(dtype=torch.float32)
        diff = torch.abs(existence_pred - existence_gt)
        loss = weights * diff
        return loss