Autotuning

One of the main features available in cuDecomp is the ability to perform runtime autotuning of the process grid dimensions used to partition the global domain and communication backends used for transpose and/or halo communication. This feature enables users to run the library using the best performing configuration for a given global domain size, number of tasks, and compute cluster topology. The autotuner aims to select decomposition and communication backend options that minimize transpose and halo communication time.

Autotuning process

The autotuning process in cuDecomp can be logically split into two categories:

process grid autotuning

communication backend autotuning

Process grid autotuning refers to selecting the \(P_{\text{rows}} \times P_{\text{cols}}\) of the process distribution, among the possible combinations with \(P_{\text{rows}} \times P_{\text{cols}} = N_{\text{GPU}}\).

Communication backend autotuning refers to selecting the transpose and/or halo backends used for communication between processes.

With all autotuning options enabled (i.e. full autotuning), cuDecomp will run the autotuning process in two phases.

During the first phase, cuDecomp will run all possible pairs of process grid dimensions and communication backends of a user-selected type (either transpose or halo communication), identifying and selecting the pair that achieves the lowest average runtime out of 5 measured trials. For transpose communication backends, the trial time is the time it takes to complete the full set of transposes (XToY, YToZ, ZToY, YToX). For halo communication backends, the trial time is the time it takes to complete a full set of halo updates (in all three grid directions), for a user-defined halo configuration.

Once a process grid and communication backend is selected that minimizes communication time of the user-selected type, the autotuner will run a second phase to select a communication backend for the unselected communication type. In this phase, the process grid selected during the first phase is fixed and the backend with the minimum average trial runtime over 5 measured trials is selected.

In lieu of autotuning all options, users can also fix the process grid or communication options to limit the autotuning (for example, fixing the process grid and autotuning the transpose and halo communication backends only).

We will illustrate this process in the following sections.

Autotuning usage

In this section, we will use a modified version of the basic usage example (C++, Fortran), explaining the changes required to enable the autotuning feature.

Creating a grid descriptor with autotuning enabled

Enabling autotuning requires additional steps prior to the grid descriptor creation.

To enable process grid autotuning, instead of setting pdims entries in the configuration struct to fixed values, initialize them to zero. This indicates to cuDecomp that process grid autotuning is desired. Otherwise, if the pdims entries are set to desired process grid dimensions, process grid autotuning is disabled and these will remain fixed during the autotuning process.

config.pdims[0] = 0; // P_rows
config.pdims[1] = 0; // P_cols

config%pdims = [0, 0] ! [P_rows, P_cols]

In addition to this modification of pdims, an autotuning options structure, cudecompGridDescAutotuneOptions_t must be created, populated, and passed as an additional argument to cudecompGridDescCreate.

Create an uninitialized autotune option struct and initialize it to defaults using cudecompGridDescAutotuneOptionsSetDefaults. Initializing this struct to default values is required to ensure no entries are left uninitialized.

cudecompGridDescAutotuneOptions_t options;
CHECK_CUDECOMP_EXIT(cudecompGridDescAutotuneOptionsSetDefaults(&options));

type(cudecompGridDescAutotuneOptions) :: options
...

istat = cudecompGridDescAutotuneOptionsSetDefaults(options)
call CHECK_CUDECOMP_EXIT(istat)

First, let’s go over general autotuning options that effect process grid and communication backend autotuning.

The n_warmup_trials and n_trials entries in the options struct control the number of warmup and timed trials run for each tested configuration respectively. Here we set them to their default values.

options.n_warmup_trials = 3;
options.n_trials = 5;

options%n_warmup_trials = 3
options%n_trials = 5

The dtype entry in the options struct controls which data type cuDecomp will use for autotuning.

options.dtype = CUDECOMP_DOUBLE;

options%dtype = CUDECOMP_DOUBLE

The disable_nccl_backends and disable_nvshmem_backends entries are boolean flags controlling whether the autotuner will test transpose and halo communication backends using the NCCL or NVSHMEM libraries respectively. By default, these flags are set to false and NCCL and NVSHMEM backends are enabled.

options.disable_nccl_backends = false;
options.disable_nvshmem_backends = false;

options%disable_nccl_backends = .false.
options%disable_nvshmem_backends = .false.

The skip_threshold entry allows the autotuner to rapidly skip slow performing configurations. In particular, the autotuner will skip testing a configuration if skip_threshold * t > t_best, where t is the duration of the first timed trial for the configuration and t_best is the average trial time of the current best configuration. By default, the threshold is set to zero which disables any skipping. More aggressive skipping can be useful in cases where exhaustive testing of all possible configurations is too expensive.

options.skip_threshold = 0.0;

options%skip_threshold = 0.0

Moving on, these are the options specific to process grid autotuning.

The grid_mode entry controls which type of communication (transpose or halo) to use to autotune the process grid dimensions (see cudecompAutotuneGridMode_t). By default, transpose communication is used.

options.grid_mode = CUDECOMP_AUTOTUNE_GRID_TRANSPOSE;

options%grid_mode = CUDECOMP_AUTOTUNE_GRID_TRANSPOSE

The allow_uneven_decompositions entry is a boolean flag controlling whether the autotuner will test process grid dimensions that result in uneven distributions of data (i.e. grids where pencil shapes are not identical across ranks). By default, this flag is set to true and uneven distributions are allowed.

options.allow_uneven_decompositions = true;

options%allow_uneven_decompositions = .true.

Next, these are the options specific to transpose communication backend autotuning.

The autotune_transpose_backend entry is a boolean flag controlling whether the autotuner will autotune the communication backend used for transposes. By default, this flag is false and the transpose communication backend is fixed to the value set within the configuration struct during the autotuning process. In this example, we set it to true to enable transpose backend autotuning.

options.autotune_transpose_backend = true;

options%autotune_transpose_backend = .true.

The transpose_use_inplace_buffers entry is an array of boolean flags that controls whether the transpose communication during autotuning is performed in-place or out-of-place, on a per operation basis. This choice can impact transpose performance due to some optimized paths that skip intermediate local operations in some situations, depending on the input/output buffer locations.

For example, cudecompTransposeXToY can be a no-op if:

the process grid yields a decomposition with \(XY\)-slabs (i.e. distributed along the \(Z\)-axis only)

the \(X\)- and \(Y\)-pencils are not in the permuted axis_contiguous layout

the transposition is performed in-place

In this configuration, the \(X\)- and \(Y\)-pencil buffers are in identical layouts and contain the same data elements from the global grid. Since the transpose is in-place, the input is already the output buffer, and no operation is performed. In contrast, an out-of-place transpose would require a copy of data between the input and output buffers.

In this example, we use in-place buffers for all transpose operations so we can set all elements of transpose_use_inplace_buffers to true. By default, the entries are set to false and out-of-place buffers are used during autotuning.

options.transpose_use_inplace_buffers[0] = true; // use in-place buffers for X-to-Y transpose
options.transpose_use_inplace_buffers[1] = true; // use in-place buffers for Y-to-Z transpose
options.transpose_use_inplace_buffers[2] = true; // use in-place buffers for Z-to-Y transpose
options.transpose_use_inplace_buffers[3] = true; // use in-place buffers for Y-to-X transpose

options%transpose_use_inplace_buffers(1) = .true. ! use in-place buffers for X-to-Y transpose
options%transpose_use_inplace_buffers(2) = .true. ! use in-place buffers for Y-to-Z transpose
options%transpose_use_inplace_buffers(3) = .true. ! use in-place buffers for Z-to-Y transpose
options%transpose_use_inplace_buffers(4) = .true. ! use in-place buffers for Y-to-X transpose

The transpose_op_weights entry is an array of floating point weights that enable adjusting the contribution of the different transpose operations to the trial timings used by the autotuner. By default, the trial timings used by the autotuner are an unweighted sum of the X-to-Y, Y-to-Z, Z-to-Y, and Y-to-X transpose timings. The entries in transpose_op_weights are multiplicative weights that are applied to the contribution of each transpose operation to the total trial timing. This option is meant for programs that may invoke the different transpose operations an unequal number of times and may want the autotuner to emphasize the more frequently invoked transpose operations when measuring the performance of a backend and process grid configuration. For example, setting the weight to 0.0 for one of the transpose operations will indicate to the autotuner that the timing of that operation should not contribute to the trial time sum. On a related note, the autotuner will skip running any transpose operation with a weight of 0.0 for efficiency. In this example, we autotune using the full set of transpose operations, and therefore set all elements of transpose_op_weights to 1.0. We should note that this is the default behavior, and thus there is no need to explicitly set the elements to 1.0 generally.

options.transpose_op_weights[0] = 1.0; // apply 1.0 multiplier to X-to-Y transpose timings
options.transpose_op_weights[1] = 1.0; // apply 1.0 multiplier to Y-to-Z transpose timings
options.transpose_op_weights[2] = 1.0; // apply 1.0 multiplier to Z-to-Y transpose timings
options.transpose_op_weights[3] = 1.0; // apply 1.0 multiplier to Y-to-X transpose timings

options%transpose_op_weights(1) = 1.0 ! apply 1.0 multiplier to X-to-Y transpose timings
options%transpose_op_weights(2) = 1.0 ! apply 1.0 multiplier to Y-to-Z transpose timings
options%transpose_op_weights(3) = 1.0 ! apply 1.0 multiplier to Z-to-Y transpose timings
options%transpose_op_weights(4) = 1.0 ! apply 1.0 multiplier to Y-to-X transpose timings

There are additional entries to supply per transpose operation input_halo_extents, output_halo_extents, input_padding, and output_padding arguments for use during autotuning: tranpose_input_halo_extents, transpose_output_halo_extents, transpose_input_padding, and transpose_output_padding respectively. These entries default to supplying zero halo extent and padding arguments to the transpose operations. In this example, we want the autotuner to supply non-zero input_halo_extents and output_halo_extents arguments to the transpose operations involving the \(X\)-pencil.

options.transpose_input_halo_extents[0][0] = 1; // set input_halo_extents to [1, 1, 1] for X-to-Y transpose
options.transpose_input_halo_extents[0][1] = 1;
options.transpose_input_halo_extents[0][2] = 1;
options.transpose_output_halo_extents[3][0] = 1; // set output_halo_extents to [1, 1, 1] for Y-to-X transpose
options.transpose_output_halo_extents[3][1] = 1;
options.transpose_output_halo_extents[3][2] = 1;

options%transpose_input_halo_extents(:, 1) = [1, 1, 1] ! set input_halo_extents to [1, 1, 1] for X-to-Y transpose
options%transpose_output_halo_extents(:, 4) = [1, 1, 1] ! set output_halo_extents to [1, 1, 1] for Y-to-X transpose

All other transpose operations do not have non-zero halo extents or padding arguments, so we leave those entries as their default values.

Lastly, these are the options specific to halo communication backend autotuning.

The autotune_halo_backend entry is a boolean flag controlling whether the autotuner will autotune the communication backend used for halo exchanges. By default, this flag is false and the halo communication backend is fixed to the value set within the configuration struct during the autotuning process. In this example, we set it to true to enable halo backend autotuning.

options.autotune_halo_backend = true;

options%autotune_halo_backend = .true.

The halo_extents, halo_periods, and halo_axis define the halo configuration to use during halo autotuning. See documentation on the halo communication routines, like cudecompUpdateHalosX for descriptions of these fields. In this example, we autotune for \(X\)-pencil halo exchanges with one halo element in each direction with periodic boundaries.

options.halo_axis = 0;

options.halo_extents[0] = 1;
options.halo_extents[1] = 1;
options.halo_extents[2] = 1;

options.halo_periods[0] = true;
options.halo_periods[1] = true;
options.halo_periods[2] = true;

options%halo_axis = 1

options%halo_extents = [1, 1, 1]

options%halo_periods = [.true., .true., .true]

There is an additional entry, halo_padding, to supply the padding argument to use during halo autotuning. This entry defaults to supplying zero padding. In this example, we do not supply non-zero padding for the halo operations so we leave this entry default valued.

With the grid descriptor configuration and autotuning options structures created and populated, we can now create the grid descriptor with autotuning.

cudecompGridDesc_t grid_desc;
CHECK_CUDECOMP_EXIT(cudecompGridDescCreate(handle, &grid_desc, &config, &options));

istat = cudecompGridDescCreate(handle, grid_desc, config, options)
call CHECK_CUDECOMP_EXIT(istat)

Autotuner output and querying results

When autotuning is enabled, cuDecomp will produce additional output to stdout to report on the autotuning process, providing trial timings and similar information on the tested configurations.

For example, running this example on a 4 GPU system will produce output as follows. First, the autotuner will run the first phase, which in this case is performing process grid autotuning and transpose backend autotuning (since we set grid_mode = CUDECOMP_AUTOTUNE_GRID_TRANSPOSE. The output generated from this phase will be like the following:

CUDECOMP: Running transpose autotuning...
CUDECOMP:       grid: 1 x 4, backend: MPI_P2P
CUDECOMP:       Total time min/max/avg/std [ms]: 0.266102/0.276084/0.270158/0.003797
CUDECOMP:       TransposeXY time min/max/avg/std [ms]: 0.018432/0.026624/0.020941/0.002208
CUDECOMP:       TransposeYZ time min/max/avg/std [ms]: 0.101376/0.110592/0.104806/0.002341
CUDECOMP:       TransposeZY time min/max/avg/std [ms]: 0.095232/0.101376/0.097229/0.001602
CUDECOMP:       TransposeYX time min/max/avg/std [ms]: 0.015360/0.020480/0.017459/0.001354
CUDECOMP:       grid: 1 x 4, backend: MPI_P2P (pipelined)
CUDECOMP:       Total time min/max/avg/std [ms]: 0.456339/0.483480/0.467483/0.011253
CUDECOMP:       TransposeXY time min/max/avg/std [ms]: 0.018432/0.024576/0.020531/0.001354
CUDECOMP:       TransposeYZ time min/max/avg/std [ms]: 0.188416/0.196608/0.191488/0.002243
CUDECOMP:       TransposeZY time min/max/avg/std [ms]: 0.194560/0.229376/0.207411/0.011130
CUDECOMP:       TransposeYX time min/max/avg/std [ms]: 0.016384/0.022528/0.019046/0.001498
CUDECOMP:       grid: 1 x 4, backend: MPI_A2A
CUDECOMP:       Total time min/max/avg/std [ms]: 0.253752/0.275133/0.262857/0.006987
CUDECOMP:       TransposeXY time min/max/avg/std [ms]: 0.017408/0.021504/0.019661/0.001054

...

CUDECOMP:       grid: 2 x 2, backend: MPI_P2P
CUDECOMP:       Total time min/max/avg/std [ms]: 0.302244/0.306223/0.303693/0.001211
CUDECOMP:       TransposeXY time min/max/avg/std [ms]: 0.067584/0.078848/0.072704/0.003123
CUDECOMP:       TransposeYZ time min/max/avg/std [ms]: 0.068608/0.081920/0.073165/0.003569
CUDECOMP:       TransposeZY time min/max/avg/std [ms]: 0.060416/0.067584/0.063949/0.002278
CUDECOMP:       TransposeYX time min/max/avg/std [ms]: 0.059392/0.077824/0.069530/0.005864
CUDECOMP:       grid: 2 x 2, backend: MPI_P2P (pipelined)
CUDECOMP:       Total time min/max/avg/std [ms]: 0.346133/0.354265/0.350742/0.002535
CUDECOMP:       TransposeXY time min/max/avg/std [ms]: 0.073728/0.087040/0.080538/0.005184
CUDECOMP:       TransposeYZ time min/max/avg/std [ms]: 0.072704/0.093184/0.082586/0.006470
CUDECOMP:       TransposeZY time min/max/avg/std [ms]: 0.072704/0.080896/0.076851/0.002941
CUDECOMP:       TransposeYX time min/max/avg/std [ms]: 0.070656/0.098304/0.083558/0.008859
CUDECOMP:       grid: 2 x 2, backend: MPI_A2A
CUDECOMP:       Total time min/max/avg/std [ms]: 0.289410/0.320966/0.298509/0.011557
CUDECOMP:       TransposeXY time min/max/avg/std [ms]: 0.064512/0.074752/0.070093/0.003197
CUDECOMP:       TransposeYZ time min/max/avg/std [ms]: 0.065536/0.084992/0.073011/0.005610

...

CUDECOMP:       grid: 4 x 1, backend: MPI_P2P
CUDECOMP:       Total time min/max/avg/std [ms]: 0.227092/0.233280/0.229325/0.002050
CUDECOMP:       TransposeXY time min/max/avg/std [ms]: 0.092160/0.099328/0.095181/0.001956
CUDECOMP:       TransposeYZ time min/max/avg/std [ms]: 0.011264/0.016384/0.013005/0.001556
CUDECOMP:       TransposeZY time min/max/avg/std [ms]: 0.009216/0.012288/0.010240/0.000971
CUDECOMP:       TransposeYX time min/max/avg/std [ms]: 0.083968/0.095232/0.087910/0.003042
CUDECOMP:       grid: 4 x 1, backend: MPI_P2P (pipelined)
CUDECOMP:       Total time min/max/avg/std [ms]: 0.355253/0.363846/0.358656/0.003062
CUDECOMP:       TransposeXY time min/max/avg/std [ms]: 0.147456/0.155648/0.150938/0.002736
CUDECOMP:       TransposeYZ time min/max/avg/std [ms]: 0.011264/0.015360/0.013363/0.001354
CUDECOMP:       TransposeZY time min/max/avg/std [ms]: 0.010240/0.014336/0.011878/0.001566
CUDECOMP:       TransposeYX time min/max/avg/std [ms]: 0.152576/0.172032/0.158720/0.005909
CUDECOMP:       grid: 4 x 1, backend: MPI_A2A
CUDECOMP:       Total time min/max/avg/std [ms]: 0.220565/0.226522/0.224512/0.002049
CUDECOMP:       TransposeXY time min/max/avg/std [ms]: 0.075776/0.095232/0.086118/0.008030
CUDECOMP:       TransposeYZ time min/max/avg/std [ms]: 0.010240/0.015360/0.012493/0.001539

...

CUDECOMP: SELECTED: grid: 4 x 1, backend: NCCL, Avg. time 0.138808
CUDECOMP: transpose autotuning time [s]: 1.589209

The first highlighted block of output shows the autotuning trial results for one configuration tested, in this case, a \(1 \times 4\) process grid paired with the MPI_P2P (i.e. CUDECOMP_TRANSPOSE_COMM_MPI_P2P) transpose communication backend. The total time to complete all transposes is listed first, with the minimum, maximum, average, and standard deviation over the trials printed. Following this, a further breakdown of the transpose timings by operation is listed to provide additional insight into the performance.

The autotuner then proceeds to try other possible process grid and transpose communication backend pairs, in this case continuing on to test \(2 \times 2\) process grid options and \(1 \times 4\) process grid options. After all the configurations are tested, the autotuner selects the process grid and transpose communication backend pair that achieves the lowest average trial time, and reports the selection, shown by the highlighted line at the end of the block. In this case, it selected a \(4 \times 1\) process grid using the NCCL (i.e. CUDECOMP_TRANSPOSE_COMM_NCCL) backend.

If autotuning of the other type of communication is requested, the autotuning procedure moves onto the second phase, selecting the best communication backend for this communication using the process grid selected in the first phase. In this example, the second phase of autotuning is done to select a halo communication backend to use on the selected \(4 \times 1\) process grid.

CUDECOMP: Running halo autotuning...
CUDECOMP: Autotune halo axis: x
CUDECOMP:       grid: 4 x 1, halo backend: MPI
CUDECOMP:       Total time min/max/avg/std [ms]: 0.068239/0.074815/0.070960/0.002477
CUDECOMP:       grid: 4 x 1, halo backend: MPI (blocking)
CUDECOMP:       Total time min/max/avg/std [ms]: 0.073353/0.085638/0.077625/0.003406
CUDECOMP:       grid: 4 x 1, halo backend: NCCL
CUDECOMP:       Total time min/max/avg/std [ms]: 0.053063/0.063682/0.057200/0.003232
CUDECOMP:       grid: 4 x 1, halo backend: NVSHMEM
CUDECOMP:       Total time min/max/avg/std [ms]: 0.050031/0.052668/0.051291/0.000742
CUDECOMP:       grid: 4 x 1, halo backend: NVSHMEM (blocking)
CUDECOMP:       Total time min/max/avg/std [ms]: 0.063190/0.067428/0.065849/0.001215
CUDECOMP: SELECTED: grid: 4 x 1, halo backend: NVSHMEM, Avg. time [s] 0.051291
CUDECOMP: halo autotuning time [s]: 0.227950

The first highlighted block shows the results for one configuration tested, in this case the MPI (i.e. CUDECOMP_HALO_COMM_MPI) halo communication backend operating on a \(4 \times 1\) process grid. The total time to complete the full set of halo exchanges for the user-selected pencil axis is reported similar to the transpose trials. The autotuner proceeds to test all the other halo communication backend options, selects the one achieving the lowest average trial time, and reports the selection in the final highlighted line. In this case, it selected the NVSHMEM halo communication backend (i.e. CUDECOMP_HALO_COMM_NVSHMEM).

After the autotuning process is complete, the grid descriptor is created and ready to use. Entries in the configuration struct provided to cudecompGridDescCreate corresponding to autotuned fields (pdims, transpose_comm_backend, and halo_comm_backend) are updated to reflect the autotuning selections. Thus, one can also run the following code like the following to inspect and report the final configuration used:

if (rank == 0) {
  printf("running on %d x %d process grid...\n", config.pdims[0], config.pdims[1]);
  printf("running using %s transpose backend...\n",
         cudecompTransposeCommBackendToString(config.transpose_comm_backend));
  printf("running using %s halo backend...\n",
         cudecompHaloCommBackendToString(config.halo_comm_backend));
}

if (rank == 0) then
  write(*,"('running on ', i0, ' x ', i0, ' process grid ...')") config%pdims(1), config%pdims(2)
  write(*,"('running using ', a, ' transpose backend ...')") &
            cudecompTransposeCommBackendToString(config%transpose_comm_backend)
  write(*,"('running using ', a, ' halo backend ...')") &
            cudecompHaloCommBackendToString(config%halo_comm_backend)
endif

As autotuning only impacts grid descriptor creation, the rest of the usage of the library is unchanged from that illustrated in the basic usage section.