.. _autotuning-section-ref:

Autotuning
====================
One of the main features available in cuDecomp is the ability to perform
runtime autotuning of the process grid dimensions used to partition the
global domain and communication backends used for transpose and/or halo
communication. This feature enables users to run the library using the
best performing configuration for a given global domain size, number of tasks,
and compute cluster topology. The autotuner aims to select decomposition and
communication backend options that minimize transpose and halo communication time.


Autotuning process
------------------
The autotuning process in cuDecomp can be logically split into two categories:

  1. process grid autotuning
  2. communication backend autotuning

Process grid autotuning refers to selecting the :math:`P_{\text{rows}} \times P_{\text{cols}}`
of the process distribution, among the possible combinations with :math:`P_{\text{rows}} \times P_{\text{cols}} = N_{\text{GPU}}`.

Communication backend autotuning refers to selecting the transpose and/or halo backends used for communication between processes.

With all autotuning options enabled (i.e. full autotuning), cuDecomp will run the autotuning process in two phases.

During the first phase, cuDecomp will run all possible pairs of process grid dimensions
and communication backends of a user-selected type (either transpose or halo communication), identifying and selecting the pair that achieves the lowest average runtime out of 5 measured trials.
For transpose communication backends, the trial time is the time it takes to complete the
full set of transposes (XToY, YToZ, ZToY, YToX). For halo communication backends, the trial time is the time it takes to
complete a full set of halo updates (in all three grid directions), for a user-defined halo configuration.

Once a process grid and communication backend is selected that minimizes communication time of the user-selected type, the autotuner
will run a second phase to select a communication backend for the unselected communication type. In this phase, the process grid selected during the first phase is fixed and the backend with the minimum average trial runtime over 5 measured trials is selected.

In lieu of autotuning all options, users can also fix the process grid or communication options to limit the autotuning (for example,
fixing the process grid and autotuning the transpose and halo communication backends only).

We will illustrate this process in the following sections.


Autotuning usage
----------------
In this section, we will use a modified version of the basic usage example
(`C++ <https://github.com/NVIDIA/cuDecomp/blob/main/examples/cc/basic_usage/basic_usage_autotune.cu>`_, `Fortran <https://github.com/NVIDIA/cuDecomp/blob/main/examples/fortran/basic_usage/basic_usage_autotune.f90>`_),
explaining the changes required to enable the autotuning feature.

Creating a grid descriptor with autotuning enabled
__________________________________________________
Enabling autotuning requires additional steps prior to the grid descriptor creation.

To enable process grid autotuning, instead of setting :code:`pdims` entries in the configuration struct to fixed values,
initialize them to zero. This indicates to cuDecomp that process grid autotuning is desired. Otherwise, if the :code:`pdims`
entries are set to desired process grid dimensions, process grid autotuning is disabled and these will remain fixed during
the autotuning process.

.. tabs::

  .. code-tab:: c++

    config.pdims[0] = 0; // P_rows
    config.pdims[1] = 0; // P_cols

  .. code-tab:: fortran

    config%pdims = [0, 0] ! [P_rows, P_cols]

In addition to this modification of :code:`pdims`, an autotuning options structure, :ref:`cudecompGridDescAutotuneOptions_t-ref` must be created,
populated, and passed as an additional argument to :ref:`cudecompGridDescCreate-ref`.

Create an uninitialized autotune option struct and initialize it to defaults using :ref:`cudecompGridDescAutotuneOptionsSetDefaults-ref`. Initializing this struct to default values is required to ensure no entries are left uninitialized.

.. tabs::

  .. code-tab:: c++

    cudecompGridDescAutotuneOptions_t options;
    CHECK_CUDECOMP_EXIT(cudecompGridDescAutotuneOptionsSetDefaults(&options));

  .. code-tab:: fortran

    type(cudecompGridDescAutotuneOptions) :: options
    ...

    istat = cudecompGridDescAutotuneOptionsSetDefaults(options)
    call CHECK_CUDECOMP_EXIT(istat)


First, let's go over general autotuning options that effect process grid and communication backend autotuning.

The :code:`n_warmup_trials` and :code:`n_trials` entries in the options struct control the number of warmup
and timed trials run for each tested configuration respectively. Here we set them to their default values.

.. tabs::

  .. code-tab:: c++

    options.n_warmup_trials = 3;
    options.n_trials = 5;

  .. code-tab:: fortran

    options%n_warmup_trials = 3
    options%n_trials = 5


The :code:`dtype` entry in the options struct controls which data type cuDecomp will use for autotuning.

.. tabs::

  .. code-tab:: c++

    options.dtype = CUDECOMP_DOUBLE;

  .. code-tab:: fortran

    options%dtype = CUDECOMP_DOUBLE

The :code:`disable_nccl_backends` and :code:`disable_nvshmem_backends` entries are boolean flags controlling whether
the autotuner will test transpose and halo communication backends using the NCCL or NVSHMEM libraries respectively.
By default, these flags are set to false and NCCL and NVSHMEM backends are enabled.

.. tabs::

  .. code-tab:: c++

    options.disable_nccl_backends = false;
    options.disable_nvshmem_backends = false;

  .. code-tab:: fortran

    options%disable_nccl_backends = .false.
    options%disable_nvshmem_backends = .false.

The :code:`skip_threshold` entry allows the autotuner to rapidly skip slow performing configurations. In particular,
the autotuner will skip testing a configuration if :code:`skip_threshold * t > t_best`, where :code:`t` is the duration
of the first timed trial for the configuration and :code:`t_best` is the average trial time of the current best configuration. 
By default, the threshold is set to zero which disables any skipping. More aggressive skipping can be useful in cases where exhaustive
testing of all possible configurations is too expensive.

.. tabs::

  .. code-tab:: c++

    options.skip_threshold = 0.0;

  .. code-tab:: fortran

    options%skip_threshold = 0.0

Moving on, these are the options specific to process grid autotuning.

The :code:`grid_mode` entry controls which type of communication (transpose or halo) to use to autotune the
process grid dimensions (see :ref:`cudecompAutotuneGridMode_t-ref`). By default, transpose communication is
used.

.. tabs::

  .. code-tab:: c++

    options.grid_mode = CUDECOMP_AUTOTUNE_GRID_TRANSPOSE;

  .. code-tab:: fortran

    options%grid_mode = CUDECOMP_AUTOTUNE_GRID_TRANSPOSE

The :code:`allow_uneven_decompositions` entry is a boolean flag controlling whether the autotuner will test
process grid dimensions that result in uneven distributions of data (i.e. grids where pencil shapes are not
identical across ranks). By default, this flag is set to :code:`true` and uneven distributions are allowed.

.. tabs::

  .. code-tab:: c++

    options.allow_uneven_decompositions = true;

  .. code-tab:: fortran

    options%allow_uneven_decompositions = .true.

Next, these are the options specific to transpose communication backend autotuning.

The :code:`autotune_transpose_backend` entry is a boolean flag controlling whether the autotuner will autotune
the communication backend used for transposes. By default, this flag is :code:`false` and the transpose communication
backend is fixed to the value set within the configuration struct during the autotuning process. In
this example, we set it to true to enable transpose backend autotuning.

.. tabs::

  .. code-tab:: c++

    options.autotune_transpose_backend = true;

  .. code-tab:: fortran

    options%autotune_transpose_backend = .true.

The :code:`transpose_use_inplace_buffers` entry is an array of boolean flags that controls whether the transpose
communication during autotuning is performed in-place or out-of-place, on a per operation basis. This choice can impact transpose
performance due to some optimized paths that skip intermediate local operations in some situations,
depending on the input/output buffer locations.

For example, :code:`cudecompTransposeXToY` can be a no-op if:

  1. the process grid yields a decomposition with :math:`XY`-slabs (i.e. distributed along the :math:`Z`-axis only)
  2. the :math:`X`- and :math:`Y`-pencils are not in the permuted :code:`axis_contiguous` layout
  3. the transposition is performed in-place

In this configuration, the :math:`X`- and :math:`Y`-pencil buffers are in identical layouts and
contain the same data elements from the global grid. Since the transpose is in-place, the input
is already the output buffer, and no operation is performed. In contrast, an out-of-place transpose
would require a copy of data between the input and output buffers.

In this example, we use in-place buffers for all transpose operations so we can set all elements of :code:`transpose_use_inplace_buffers` to :code:`true`.
By default, the entries are set to :code:`false` and out-of-place buffers are used during autotuning.

.. tabs::

  .. code-tab:: c++

    options.transpose_use_inplace_buffers[0] = true; // use in-place buffers for X-to-Y transpose
    options.transpose_use_inplace_buffers[1] = true; // use in-place buffers for Y-to-Z transpose
    options.transpose_use_inplace_buffers[2] = true; // use in-place buffers for Z-to-Y transpose
    options.transpose_use_inplace_buffers[3] = true; // use in-place buffers for Y-to-X transpose

  .. code-tab:: fortran

    options%transpose_use_inplace_buffers(1) = .true. ! use in-place buffers for X-to-Y transpose
    options%transpose_use_inplace_buffers(2) = .true. ! use in-place buffers for Y-to-Z transpose
    options%transpose_use_inplace_buffers(3) = .true. ! use in-place buffers for Z-to-Y transpose
    options%transpose_use_inplace_buffers(4) = .true. ! use in-place buffers for Y-to-X transpose

The :code:`transpose_op_weights` entry is an array of floating point weights that enable adjusting the
contribution of the different transpose operations to the trial timings used by the autotuner. By default,
the trial timings used by the autotuner are an unweighted sum of the X-to-Y, Y-to-Z, Z-to-Y, and Y-to-X transpose timings.
The entries in :code:`transpose_op_weights` are multiplicative weights that are applied to the
contribution of each transpose operation to the total trial timing.
This option is meant for programs that may invoke the different transpose operations an unequal
number of times and may want the autotuner to emphasize the more frequently invoked transpose operations
when measuring the performance of a backend and process grid configuration. For example, setting
the weight to :code:`0.0` for one of the transpose operations will indicate to the autotuner that the timing
of that operation should not contribute to the trial time sum. On a related note, the autotuner will skip running
any transpose operation with a weight of :code:`0.0` for efficiency.
In this example, we autotune using the full set of transpose
operations, and therefore set all elements of :code:`transpose_op_weights` to :code:`1.0`.
We should note that this is the default behavior, and thus there is no need to explicitly set
the elements to :code:`1.0` generally.

.. tabs::

  .. code-tab:: c++

    options.transpose_op_weights[0] = 1.0; // apply 1.0 multiplier to X-to-Y transpose timings
    options.transpose_op_weights[1] = 1.0; // apply 1.0 multiplier to Y-to-Z transpose timings
    options.transpose_op_weights[2] = 1.0; // apply 1.0 multiplier to Z-to-Y transpose timings
    options.transpose_op_weights[3] = 1.0; // apply 1.0 multiplier to Y-to-X transpose timings

  .. code-tab:: fortran

    options%transpose_op_weights(1) = 1.0 ! apply 1.0 multiplier to X-to-Y transpose timings
    options%transpose_op_weights(2) = 1.0 ! apply 1.0 multiplier to Y-to-Z transpose timings
    options%transpose_op_weights(3) = 1.0 ! apply 1.0 multiplier to Z-to-Y transpose timings
    options%transpose_op_weights(4) = 1.0 ! apply 1.0 multiplier to Y-to-X transpose timings

There are additional entries to supply per transpose operation :code:`input_halo_extents`, :code:`output_halo_extents`,
:code:`input_padding`, and :code:`output_padding` arguments for use during autotuning: :code:`tranpose_input_halo_extents`, :code:`transpose_output_halo_extents`,
:code:`transpose_input_padding`, and :code:`transpose_output_padding` respectively. These entries default to supplying zero halo extent and padding
arguments to the transpose operations. In this example, we want the autotuner to supply non-zero :code:`input_halo_extents` and :code:`output_halo_extents` arguments
to the transpose operations involving the :math:`X`-pencil.

.. tabs::

  .. code-tab:: c++

    options.transpose_input_halo_extents[0][0] = 1; // set input_halo_extents to [1, 1, 1] for X-to-Y transpose
    options.transpose_input_halo_extents[0][1] = 1;
    options.transpose_input_halo_extents[0][2] = 1;
    options.transpose_output_halo_extents[3][0] = 1; // set output_halo_extents to [1, 1, 1] for Y-to-X transpose
    options.transpose_output_halo_extents[3][1] = 1;
    options.transpose_output_halo_extents[3][2] = 1;

  .. code-tab:: fortran

    options%transpose_input_halo_extents(:, 1) = [1, 1, 1] ! set input_halo_extents to [1, 1, 1] for X-to-Y transpose
    options%transpose_output_halo_extents(:, 4) = [1, 1, 1] ! set output_halo_extents to [1, 1, 1] for Y-to-X transpose

All other transpose operations do not have non-zero halo extents or padding arguments, so we leave those entries as their default values.

Lastly, these are the options specific to halo communication backend autotuning.

The :code:`autotune_halo_backend` entry is a boolean flag controlling whether the autotuner will autotune
the communication backend used for halo exchanges. By default, this flag is :code:`false` and the halo communication
backend is fixed to the value set within the configuration struct during the autotuning process. In
this example, we set it to true to enable halo backend autotuning.

.. tabs::

  .. code-tab:: c++

    options.autotune_halo_backend = true;

  .. code-tab:: fortran

    options%autotune_halo_backend = .true.

The :code:`halo_extents`, :code:`halo_periods`, and :code:`halo_axis` define the halo configuration to use during halo autotuning.
See documentation on the halo communication routines, like :ref:`cudecompUpdateHalosX-ref` for descriptions of these
fields. In this example, we autotune for :math:`X`-pencil halo exchanges with one halo element in each direction
with periodic boundaries.

.. tabs::

  .. code-tab:: c++

    options.halo_axis = 0;

    options.halo_extents[0] = 1;
    options.halo_extents[1] = 1;
    options.halo_extents[2] = 1;

    options.halo_periods[0] = true;
    options.halo_periods[1] = true;
    options.halo_periods[2] = true;

  .. code-tab:: fortran

    options%halo_axis = 1

    options%halo_extents = [1, 1, 1]

    options%halo_periods = [.true., .true., .true]

There is an additional entry, :code:`halo_padding`, to supply the padding argument to use during halo autotuning.
This entry defaults to supplying zero padding. In this example, we do not supply non-zero padding for the halo operations
so we leave this entry default valued.

With the grid descriptor configuration and autotuning options structures created and populated,
we can now create the grid descriptor with autotuning.

.. tabs::

  .. code-tab:: c++

    cudecompGridDesc_t grid_desc;
    CHECK_CUDECOMP_EXIT(cudecompGridDescCreate(handle, &grid_desc, &config, &options));

  .. code-tab:: fortran

    istat = cudecompGridDescCreate(handle, grid_desc, config, options)
    call CHECK_CUDECOMP_EXIT(istat)

Autotuner output and querying results
_________________________________________________
When autotuning is enabled, cuDecomp will produce additional output to stdout to report
on the autotuning process, providing trial timings and similar information on the
tested configurations.

For example, running this example on a 4 GPU system will produce output as follows.
First, the autotuner will run the first phase, which in this case is performing process
grid autotuning and transpose backend autotuning (since we set :code:`grid_mode = CUDECOMP_AUTOTUNE_GRID_TRANSPOSE`.
The output generated from this phase will be like the following:

.. code-block:: none
  :emphasize-lines: 2-7, 58

  CUDECOMP: Running transpose autotuning...
  CUDECOMP:       grid: 1 x 4, backend: MPI_P2P
  CUDECOMP:       Total time min/max/avg/std [ms]: 0.266102/0.276084/0.270158/0.003797
  CUDECOMP:       TransposeXY time min/max/avg/std [ms]: 0.018432/0.026624/0.020941/0.002208
  CUDECOMP:       TransposeYZ time min/max/avg/std [ms]: 0.101376/0.110592/0.104806/0.002341
  CUDECOMP:       TransposeZY time min/max/avg/std [ms]: 0.095232/0.101376/0.097229/0.001602
  CUDECOMP:       TransposeYX time min/max/avg/std [ms]: 0.015360/0.020480/0.017459/0.001354
  CUDECOMP:       grid: 1 x 4, backend: MPI_P2P (pipelined)
  CUDECOMP:       Total time min/max/avg/std [ms]: 0.456339/0.483480/0.467483/0.011253
  CUDECOMP:       TransposeXY time min/max/avg/std [ms]: 0.018432/0.024576/0.020531/0.001354
  CUDECOMP:       TransposeYZ time min/max/avg/std [ms]: 0.188416/0.196608/0.191488/0.002243
  CUDECOMP:       TransposeZY time min/max/avg/std [ms]: 0.194560/0.229376/0.207411/0.011130
  CUDECOMP:       TransposeYX time min/max/avg/std [ms]: 0.016384/0.022528/0.019046/0.001498
  CUDECOMP:       grid: 1 x 4, backend: MPI_A2A
  CUDECOMP:       Total time min/max/avg/std [ms]: 0.253752/0.275133/0.262857/0.006987
  CUDECOMP:       TransposeXY time min/max/avg/std [ms]: 0.017408/0.021504/0.019661/0.001054

  ...

  CUDECOMP:       grid: 2 x 2, backend: MPI_P2P
  CUDECOMP:       Total time min/max/avg/std [ms]: 0.302244/0.306223/0.303693/0.001211
  CUDECOMP:       TransposeXY time min/max/avg/std [ms]: 0.067584/0.078848/0.072704/0.003123
  CUDECOMP:       TransposeYZ time min/max/avg/std [ms]: 0.068608/0.081920/0.073165/0.003569
  CUDECOMP:       TransposeZY time min/max/avg/std [ms]: 0.060416/0.067584/0.063949/0.002278
  CUDECOMP:       TransposeYX time min/max/avg/std [ms]: 0.059392/0.077824/0.069530/0.005864
  CUDECOMP:       grid: 2 x 2, backend: MPI_P2P (pipelined)
  CUDECOMP:       Total time min/max/avg/std [ms]: 0.346133/0.354265/0.350742/0.002535
  CUDECOMP:       TransposeXY time min/max/avg/std [ms]: 0.073728/0.087040/0.080538/0.005184
  CUDECOMP:       TransposeYZ time min/max/avg/std [ms]: 0.072704/0.093184/0.082586/0.006470
  CUDECOMP:       TransposeZY time min/max/avg/std [ms]: 0.072704/0.080896/0.076851/0.002941
  CUDECOMP:       TransposeYX time min/max/avg/std [ms]: 0.070656/0.098304/0.083558/0.008859
  CUDECOMP:       grid: 2 x 2, backend: MPI_A2A
  CUDECOMP:       Total time min/max/avg/std [ms]: 0.289410/0.320966/0.298509/0.011557
  CUDECOMP:       TransposeXY time min/max/avg/std [ms]: 0.064512/0.074752/0.070093/0.003197
  CUDECOMP:       TransposeYZ time min/max/avg/std [ms]: 0.065536/0.084992/0.073011/0.005610

  ...

  CUDECOMP:       grid: 4 x 1, backend: MPI_P2P
  CUDECOMP:       Total time min/max/avg/std [ms]: 0.227092/0.233280/0.229325/0.002050
  CUDECOMP:       TransposeXY time min/max/avg/std [ms]: 0.092160/0.099328/0.095181/0.001956
  CUDECOMP:       TransposeYZ time min/max/avg/std [ms]: 0.011264/0.016384/0.013005/0.001556
  CUDECOMP:       TransposeZY time min/max/avg/std [ms]: 0.009216/0.012288/0.010240/0.000971
  CUDECOMP:       TransposeYX time min/max/avg/std [ms]: 0.083968/0.095232/0.087910/0.003042
  CUDECOMP:       grid: 4 x 1, backend: MPI_P2P (pipelined)
  CUDECOMP:       Total time min/max/avg/std [ms]: 0.355253/0.363846/0.358656/0.003062
  CUDECOMP:       TransposeXY time min/max/avg/std [ms]: 0.147456/0.155648/0.150938/0.002736
  CUDECOMP:       TransposeYZ time min/max/avg/std [ms]: 0.011264/0.015360/0.013363/0.001354
  CUDECOMP:       TransposeZY time min/max/avg/std [ms]: 0.010240/0.014336/0.011878/0.001566
  CUDECOMP:       TransposeYX time min/max/avg/std [ms]: 0.152576/0.172032/0.158720/0.005909
  CUDECOMP:       grid: 4 x 1, backend: MPI_A2A
  CUDECOMP:       Total time min/max/avg/std [ms]: 0.220565/0.226522/0.224512/0.002049
  CUDECOMP:       TransposeXY time min/max/avg/std [ms]: 0.075776/0.095232/0.086118/0.008030
  CUDECOMP:       TransposeYZ time min/max/avg/std [ms]: 0.010240/0.015360/0.012493/0.001539

  ...

  CUDECOMP: SELECTED: grid: 4 x 1, backend: NCCL, Avg. time 0.138808
  CUDECOMP: transpose autotuning time [s]: 1.589209

The first highlighted block of output shows the autotuning trial results for one configuration tested, in this case,
a :math:`1 \times 4` process grid paired with the :code:`MPI_P2P` (i.e. :code:`CUDECOMP_TRANSPOSE_COMM_MPI_P2P`) transpose
communication backend. The total time to complete all transposes is listed first, with the minimum, maximum, average, and
standard deviation over the trials printed. Following this, a further breakdown of the transpose timings by operation is listed
to provide additional insight into the performance.

The autotuner then proceeds to try other possible process grid and transpose communication backend pairs, in this case
continuing on to test :math:`2 \times 2` process grid options and :math:`1 \times 4` process grid options. After all the
configurations are tested, the autotuner selects the process grid and transpose communication backend pair that
achieves the lowest average trial time, and reports the selection, shown by the highlighted line at the end of the block.
In this case, it selected a :math:`4 \times 1` process grid using the NCCL (i.e. :code:`CUDECOMP_TRANSPOSE_COMM_NCCL`) backend.

If autotuning of the other type of communication is requested, the autotuning procedure moves onto the second phase,
selecting the best communication backend for this communication using the process grid selected in the first phase.
In this example, the second phase of autotuning is done to select a halo communication backend to use on the selected
:math:`4 \times 1` process grid.


.. code-block:: none
  :emphasize-lines: 3-4, 13

  CUDECOMP: Running halo autotuning...
  CUDECOMP: Autotune halo axis: x
  CUDECOMP:       grid: 4 x 1, halo backend: MPI
  CUDECOMP:       Total time min/max/avg/std [ms]: 0.068239/0.074815/0.070960/0.002477
  CUDECOMP:       grid: 4 x 1, halo backend: MPI (blocking)
  CUDECOMP:       Total time min/max/avg/std [ms]: 0.073353/0.085638/0.077625/0.003406
  CUDECOMP:       grid: 4 x 1, halo backend: NCCL
  CUDECOMP:       Total time min/max/avg/std [ms]: 0.053063/0.063682/0.057200/0.003232
  CUDECOMP:       grid: 4 x 1, halo backend: NVSHMEM
  CUDECOMP:       Total time min/max/avg/std [ms]: 0.050031/0.052668/0.051291/0.000742
  CUDECOMP:       grid: 4 x 1, halo backend: NVSHMEM (blocking)
  CUDECOMP:       Total time min/max/avg/std [ms]: 0.063190/0.067428/0.065849/0.001215
  CUDECOMP: SELECTED: grid: 4 x 1, halo backend: NVSHMEM, Avg. time [s] 0.051291
  CUDECOMP: halo autotuning time [s]: 0.227950

The first highlighted block shows the results for one configuration tested, in this case the :code:`MPI` (i.e. :code:`CUDECOMP_HALO_COMM_MPI`)
halo communication backend operating on a :math:`4 \times 1` process grid. The total time to complete the full set of halo exchanges for
the user-selected pencil axis is reported similar to the transpose trials. The autotuner proceeds to test all the other halo
communication backend options, selects the one achieving the lowest average trial time, and reports the selection in the final highlighted
line. In this case, it selected the NVSHMEM halo communication backend (i.e. :code:`CUDECOMP_HALO_COMM_NVSHMEM`).

After the autotuning process is complete, the grid descriptor is created and ready to use. Entries in the configuration struct provided
to :code:`cudecompGridDescCreate` corresponding
to autotuned fields (:code:`pdims`, :code:`transpose_comm_backend`, and :code:`halo_comm_backend`) are updated to reflect the autotuning
selections. Thus, one can also run the following code like the following to inspect and report the final configuration used:

.. tabs::

  .. code-tab:: c++

    if (rank == 0) {
      printf("running on %d x %d process grid...\n", config.pdims[0], config.pdims[1]);
      printf("running using %s transpose backend...\n",
             cudecompTransposeCommBackendToString(config.transpose_comm_backend));
      printf("running using %s halo backend...\n",
             cudecompHaloCommBackendToString(config.halo_comm_backend));
    }

  .. code-tab:: fortran

    if (rank == 0) then
      write(*,"('running on ', i0, ' x ', i0, ' process grid ...')") config%pdims(1), config%pdims(2)
      write(*,"('running using ', a, ' transpose backend ...')") &
                cudecompTransposeCommBackendToString(config%transpose_comm_backend)
      write(*,"('running using ', a, ' halo backend ...')") &
                cudecompHaloCommBackendToString(config%halo_comm_backend)
    endif

As autotuning only impacts grid descriptor creation, the rest of the usage of the library is unchanged from that illustrated in
the basic usage section.