cuDecomp C API

These are all the types and functions available in the cuDecomp C API.

Types

Internal types

cudecompHandle_t

typedef struct cudecompHandle *cudecompHandle_t: A pointer to a cuDecomp internal handle structure.

cudecompGridDesc_t

typedef struct cudecompGridDesc *cudecompGridDesc_t: A pointer to a cuDecomp internal grid descriptor structure.

Grid Descriptor Configuration

cudecompGridDescConfig_t

struct cudecompGridDescConfig_t

A data structure defining configuration options for grid descriptor creation.

Public Members

int32_t gdims[3]: dimensions of global data grid

int32_t gdims_dist[3]: dimensions of global data grid to use for distribution

int32_t pdims[2]: dimensions of process grid

cudecompTransposeCommBackend_t transpose_comm_backend: communication backend to use for transpose communication (default: CUDECOMP_TRANSPOSE_COMM_MPI_P2P)

bool transpose_axis_contiguous[3]: flag (by axis) indicating if memory should be contiguous along pencil axis (default: [false, false, false])

int32_t transpose_mem_order[3][3]: user-specified memory ordering by axis, overrides transpose_axis_contiguous setting; first index specifies axis, second index specifies memory order setting (default: unset)

cudecompHaloCommBackend_t halo_comm_backend: communication backend to use for halo communication (default: CUDECOMP_HALO_COMM_MPI)

cudecompGridDescAutotuneOptions_t

struct cudecompGridDescAutotuneOptions_t

A data structure defining autotuning options for grid descriptor creation.

Public Members

int32_t n_warmup_trials: number of warmup trials to run for each tested configuration during autotuning (default: 3)

int32_t n_trials: number of timed trials to run for each tested configuration during autotuning (default: 5)

cudecompAutotuneGridMode_t grid_mode: which communication (transpose/halo) to use to autotune process grid (default: CUDECOMP_AUTOTUNE_GRID_TRANSPOSE)

cudecompDataType_t dtype: datatype to use during autotuning (default: CUDECOMP_DOUBLE)

bool allow_uneven_decompositions: flag to control whether autotuning allows process grids that result in uneven distributions of elements across processes (default: true)

bool disable_nccl_backends: flag to disable NCCL backend options during autotuning (default: false)

bool disable_nvshmem_backends: flag to disable NVSHMEM backend options during autotuning (default: false)

double skip_threshold: threshold used to skip testing slow configurations; skip configuration if skip_threshold * t > t_best, where t is the duration of the first timed trial for the configuration and t_best is the average trial time of the current best configuration (default: 0.0)

bool autotune_transpose_backend: flag to enable transpose backend autotuning (default: false)

bool transpose_use_inplace_buffers[4]: flag to control whether transpose autotuning uses in-place or out-of-place buffers during autotuning by transpose operation, considering the following order: X-to-Y, Y-to-Z, Z-to-Y, Y-to-X (default: [false, false, false, false])

double transpose_op_weights[4]: multiplicative weight to apply to trial time contribution by transpose operation in the following order: X-to-Y, Y-to-Z, Z-to-Y, Y-to-X (default: [1.0, 1.0, 1.0, 1.0])

int32_t transpose_input_halo_extents[4][3]: input_halo_extents argument to use during autotuning by transpose operation; first index specifies operation in the following order: X-to-Y, Y-to-Z, Z-to-Y, Y-to-X, second index specifies halo_extent argument (default: all zeros, no halos)

int32_t transpose_output_halo_extents[4][3]: output_halo_extents argument to use during autotuning by transpose operation; first index specifies operation in the following order: X-to-Y, Y-to-Z, Z-to-Y, Y-to-X, second index specifies halo_extent argument (default: all zeros, no halos)

int32_t transpose_input_padding[4][3]: input_padding argument to use during autotuning by transpose operation; first index specifies operation in the following order: X-to-Y, Y-to-Z, Z-to-Y, Y-to-X, second index specifies input_padding argument (default: all zeros, no padding)

int32_t transpose_output_padding[4][3]: output_padding argument to use during autotuning by transpose operation; first index specifies operation in the following order: X-to-Y, Y-to-Z, Z-to-Y, Y-to-X, second index specifies input_padding argument (default: all zeros, no padding)

bool autotune_halo_backend: flag to enable halo backend autotuning (default: false)

int32_t halo_extents[3]: extents for halo autotuning (default: [0, 0, 0])

bool halo_periods[3]: periodicity for halo autotuning (default: [false, false, false])

int32_t halo_axis: which axis pencils to use for halo autotuning (default: 0, X-pencils)

int32_t halo_padding[3]: padding argument for halo autotuning (default: [0, 0, 0])

Pencil Information

cudecompPencilInfo_t

struct cudecompPencilInfo_t

A data structure containing geometry information about a pencil data buffer.

Public Members

int32_t shape[3]: pencil shape (in local order, including halo and padding elements)

int32_t lo[3]: lower bound coordinates (in local order, excluding halo and padding elements)

int32_t hi[3]: upper bound coordinates (in local order, excluding halo and padding elements)

int32_t order[3]: data layout order (e.g. 2,1,0 means memory is ordered Z,Y,X)

int32_t halo_extents[3]: halo extents by dimension (in global order)

int32_t padding[3]: padding by dimension (in global order)

int64_t size: number of elements in pencil (including halo and padding elements)

Communication Backends

cudecompTranposeCommBackend_t

enum cudecompTransposeCommBackend_t

This enum lists the different available transpose backend options.

Values:

enumerator CUDECOMP_TRANSPOSE_COMM_MPI_P2P: MPI backend using peer-to-peer algorithm (i.e.,MPI_Isend/MPI_Irecv)

enumerator CUDECOMP_TRANSPOSE_COMM_MPI_P2P_PL: MPI backend using peer-to-peer algorithm with pipelining.

enumerator CUDECOMP_TRANSPOSE_COMM_MPI_A2A: MPI backend using MPI_Alltoallv.

enumerator CUDECOMP_TRANSPOSE_COMM_NCCL: NCCL backend.

enumerator CUDECOMP_TRANSPOSE_COMM_NCCL_PL: NCCL backend with pipelining.

enumerator CUDECOMP_TRANSPOSE_COMM_NVSHMEM: NVSHMEM backend.

enumerator CUDECOMP_TRANSPOSE_COMM_NVSHMEM_PL: NVSHMEM backend with pipelining.

cudecompHaloCommBackend_t

enum cudecompHaloCommBackend_t

This enum lists the different available halo backend options.

Values:

enumerator CUDECOMP_HALO_COMM_MPI: MPI backend.

enumerator CUDECOMP_HALO_COMM_MPI_BLOCKING: MPI backend with blocking between each peer transfer.

enumerator CUDECOMP_HALO_COMM_NCCL: NCCL backend.

enumerator CUDECOMP_HALO_COMM_NVSHMEM: NVSHMEM backend.

enumerator CUDECOMP_HALO_COMM_NVSHMEM_BLOCKING: NVSHMEM backend with blocking between each peer transfer.

Additional Enumerators

cudecompDataType_t

enum cudecompDataType_t

This enum defines the data types supported.

Values:

enumerator CUDECOMP_FLOAT: Single-precision real.

enumerator CUDECOMP_DOUBLE: Double-precision real.

enumerator CUDECOMP_FLOAT_COMPLEX: Single-precision complex (interleaved)

enumerator CUDECOMP_DOUBLE_COMPLEX: Double-precision complex (interleaved)

cudecompAutotuneGridMode_t

enum cudecompAutotuneGridMode_t

This enum defines the modes available for process grid autotuning.

Values:

enumerator CUDECOMP_AUTOTUNE_GRID_TRANSPOSE: Use transpose communication to autotune process grid dimensions.

enumerator CUDECOMP_AUTOTUNE_GRID_HALO: Use halo communication to autotune process grid dimensions.

cudecompResult_t

enum cudecompResult_t

This enum defines the possible values return values from cuDecomp. Most functions in the cuDecomp library will return one of these values to indicate if an operation has completed successfully or an error occured.

Values:

enumerator CUDECOMP_RESULT_SUCCESS: The operation completed successfully.

enumerator CUDECOMP_RESULT_INVALID_USAGE: A user error, typically an invalid argument.

enumerator CUDECOMP_RESULT_NOT_SUPPORTED: A user error, requesting an invalid or unsupported operation configuration.

enumerator CUDECOMP_RESULT_INTERNAL_ERROR: An internal library error, should be reported.

enumerator CUDECOMP_RESULT_CUDA_ERROR: An error occured in the CUDA Runtime.

enumerator CUDECOMP_RESULT_CUTENSOR_ERROR: An error occured in the cuTENSOR library.

enumerator CUDECOMP_RESULT_MPI_ERROR: An error occurred in the MPI library.

enumerator CUDECOMP_RESULT_NCCL_ERROR: An error occured in the NCCL library.

enumerator CUDECOMP_RESULT_NVSHMEM_ERROR: An error occured in the NVSHMEM library.

enumerator CUDECOMP_RESULT_NVML_ERROR: An error occured in the NVML library.

Functions

Library Initialization/Finalization

cudecompInit

cudecompResult_t cudecompInit(cudecompHandle_t *handle, MPI_Comm mpi_comm)

Initializes the cuDecomp library from an existing MPI communicator.

Parameters:

handle – [out] A pointer to an uninitialized cudecompHandle_t
mpi_comm – [in] MPI communicator containing ranks to use with cuDecomp

Returns:

CUDECOMP_RESULT_SUCCESS on success or error code on failure.

cudecompFinalize

cudecompResult_t cudecompFinalize(cudecompHandle_t handle)

Finalizes the cuDecomp library and frees associated resources.

Parameters:: handle – [in] The initialized cuDecomp library handle
Returns:: CUDECOMP_RESULT_SUCCESS on success or error code on failure.

Grid Descriptor Management

cudecompGridDescCreate

cudecompResult_t cudecompGridDescCreate(cudecompHandle_t handle, cudecompGridDesc_t *grid_desc, cudecompGridDescConfig_t *config, const cudecompGridDescAutotuneOptions_t *options)

Creates a cuDecomp grid descriptor for use with cuDecomp functions.

This function creates a grid descriptor that cuDecomp requires for most library operations that perform communication or query decomposition information. This grid descriptor contains information about how the global data grid is distributed and other internal resources to facilitate communication.

Parameters:

handle – [in] The initialized cuDecomp library handle
grid_desc – [out] A pointer to an uninitialized cudecompGridDesc_t
config – [inout] A pointer to a populated cudecompGridDescConfig_t structure. This config structure defines the required attributes of the decomposition. On successful exit, fields in this structure may be updated to reflect autotuning results.
options – [in] A pointer to cudecompGridDescAutotuneOptions_t structure. This options structure is used to control the behavior of the process grid and communication backend autotuning. If autotuning is not desired, a NULL pointer can be passed in for this argument.

Returns:

CUDECOMP_RESULT_SUCCESS on success or error code on failure.

cudecompGridDescDestroy

cudecompResult_t cudecompGridDescDestroy(cudecompHandle_t handle, cudecompGridDesc_t grid_desc)

Destroys a cuDecomp grid descriptor and frees associated resources.

Parameters:

handle – [in] The initialized cuDecomp library handle
grid_desc – [in] A cuDecomp grid descriptor

Returns:

CUDECOMP_RESULT_SUCCESS on success or error code on failure.

cudecompGridDescConfigSetDefaults

cudecompResult_t cudecompGridDescConfigSetDefaults(cudecompGridDescConfig_t *config)

Initializes a cudecompGridDescConfig_t structure with default values.

This function initializes entries in a cuDecomp grid descriptor configuration structure to default values.

Parameters:: config – [inout] A pointer to cudecompGridDescConfig_t structure
Returns:: CUDECOMP_RESULT_SUCCESS on success or error code on failure.

cudecompGridDescAutotuneOptionsSetDefaults

cudecompResult_t cudecompGridDescAutotuneOptionsSetDefaults(cudecompGridDescAutotuneOptions_t *options)

Initializes a cudecompGridDescAutotuneOptions_t structure with default values.

This function initializes entries in a cuDecomp grid descriptor autotune options structure to default values.

Parameters:: options – [inout] A pointer to cudecompGridDescAutotuneOptions_t structure
Returns:: CUDECOMP_RESULT_SUCCESS on success or error code on failure.

Workspace Management

cudecompGetTransposeWorkspaceSize

cudecompResult_t cudecompGetTransposeWorkspaceSize(cudecompHandle_t handle, cudecompGridDesc_t grid_desc, int64_t *workspace_size)

Queries the required transpose workspace size, in elements, for a provided grid descriptor.

This function queries the required workspace size, in elements, for transposition communication using a provided grid descriptor. This workspace is required to faciliate local transposition/packing/unpacking operations, or for use as a staging buffer.

Parameters:

handle – [in] The initialized cuDecomp library handle
grid_desc – [in] A cuDecomp grid descriptor
workspace_size – [out] A pointer to a 64-bit integer to write the workspace size

Returns:

CUDECOMP_RESULT_SUCCESS on success or error code on failure.

cudecompGetHaloWorkspaceSize

cudecompResult_t cudecompGetHaloWorkspaceSize(cudecompHandle_t handle, cudecompGridDesc_t grid_desc, int32_t axis, const int32_t halo_extents[], int64_t *workspace_size)

Queries the required halo workspace size, in elements, for a provided grid descriptor.

This function queries the required workspace size, in elements, for halo communication using a provided grid descriptor. This workspace is required to faciliate local packing operations for halo regions that are not contiguous in memory, or for use as a staging buffer.

Parameters:

handle – [in] The initialized cuDecomp library handle
grid_desc – [in] A cuDecomp grid descriptor
axis – [in] The domain axis the desired pencil is aligned with
halo_extents – [in] An array of three integers to define halo region extents of the pencil, in global order. The i-th entry in this array should contain the number of halo elements (per direction) expected in the along the i-th global domain axis. Symmetric halos are assumed (e.g. a value of one in halo_extents means there are 2 halo elements, one element on each side).
workspace_size – [out] A pointer to a 64-bit integer to write the workspace size

Returns:

CUDECOMP_RESULT_SUCCESS on success or error code on failure.

cudecompGetDataTypeSize

cudecompResult_t cudecompGetDataTypeSize(cudecompDataType_t dtype, int64_t *dtype_size)

Function to get size (in bytes) of a cuDecomp data type.

Parameters:

dtype – [in] A cudecompDataType_t value
dtype_size – [out] A pointer to a 64-bit integer to write the data type size

Returns:

CUDECOMP_RESULT_SUCCESS on success or error code on failure.

cudecompMalloc

cudecompResult_t cudecompMalloc(cudecompHandle_t handle, cudecompGridDesc_t grid_desc, void **buffer, size_t buffer_size_bytes)

Allocation function for cuDecomp workspaces.

This function should be used to allocate cuDecomp workspaces. It will select an appropriate allocator based on the communication backend information found in the provided grid descriptor. At the current time, only NVSHMEM-enabled backends require a special allocation (using nvshmem_malloc). This function is collective and should be called on all workers to avoid deadlocks. Additionally, any memory allocated using this function is invalidated if the provided grid descriptor is destroyed and care are should be taken free memory allocated using this function before the provided grid descriptor is destroyed.

Parameters:

handle – [in] The initialized cuDecomp library handle
grid_desc – [in] A cuDecomp grid descriptor
buffer – [out] A pointer to the allocated memory
buffer_size_bytes – [out] The size of requested allocation, in bytes

Returns:

CUDECOMP_RESULT_SUCCESS on success or error code on failure.

cudecompFree

cudecompResult_t cudecompFree(cudecompHandle_t handle, cudecompGridDesc_t grid_desc, void *buffer)

Deallocation function for cuDecomp workspaces.

This function should be used to deallocate memory allocate with cudecompMalloc. It will select an appropriate deallocation function based on the communication backend information found in the provided grid descriptor. At the current time, only NVSHMEM-enabled backends require a special deallocation (using nvshmem_free). This function is collective and should be called on all workers to avoid deadlocks.

Parameters:

handle – [in] The initialized cuDecomp library handle
grid_desc – [in] A cuDecomp grid descriptor
buffer – [in] A pointer to the memory to be deallocated

Returns:

CUDECOMP_RESULT_SUCCESS on success or error code on failure.

Helper Functions

cudecompGetPencilInfo

cudecompResult_t cudecompGetPencilInfo(cudecompHandle_t handle, cudecompGridDesc_t grid_desc, cudecompPencilInfo_t *pencil_info, int32_t axis, const int32_t halo_extents[], const int32_t padding[])

Collects geometry information about assigned pencils, by domain axis.

This function queries information about the pencil assigned to the calling worker for the given axis. This information is collected in a cudecompPencilInfo_t structure, which can be used to access and manipuate data within the user-allocated memory buffer.

Parameters:

handle – [in] The initialized cuDecomp library handle
grid_desc – [in] A created cuDecomp grid descriptor
pencil_info – [out] A pointer to a cuDecompPencilInfo_t structure
axis – [in] The domain axis the desired pencil is aligned with
halo_extents – [in] An array of three integers to define halo region extents of the pencil, in global order. The i-th entry in this array should contain the number of halo elements (per direction) expected in the along the i-th global domain axis. Symmetric halos are assumed (e.g. a value of one in halo_extents means there are 2 halo elements, one element on each side). If no halo regions are necessary, a NULL pointer can be provided in place of this array.
padding – [in] An array of three integers to define padding of the pencil, in global order. The i-th entry in this array should contain the number of elements to treat as padding in the i-th global domain axis. If no padding is necesary, a NULL pointer can be provided in place of this array.

Returns:

CUDECOMP_RESULT_SUCCESS on success or error code on failure.

cudecompTranposeCommBackendToString

const char *cudecompTransposeCommBackendToString(cudecompTransposeCommBackend_t comm_backend)

Function to get string name of transpose communication backend.

Parameters:: comm_backend – [in] A cudecompTransposeCommBackend_t value
Returns:: A string representation of the transpose communication backend. Will return string “ERROR” if invalid backend value is provided.

cudecompHaloCommBackendToString

const char *cudecompHaloCommBackendToString(cudecompHaloCommBackend_t comm_backend)

Function to get string name of halo communication backend.

Parameters:: comm_backend – [in] A cudecompHaloCommBackend_t value
Returns:: A string representation of the halo communication backend. Will return string “ERROR” if invalid backend value is provided.

cudecompGetGridDescConfig

cudecompResult_t cudecompGetGridDescConfig(cudecompHandle_t handle, cudecompGridDesc_t grid_desc, cudecompGridDescConfig_t *config)

Queries the configuration used to create a grid descriptor.

Parameters:

handle – [in] The initialized cuDecomp library handle
grid_desc – [in] cuDecomp grid descriptor
config – [out] A pointer to a cuDecompGridDescConfig_t structure.

Returns:

CUDECOMP_RESULT_SUCCESS on success or error code on failure.

cudecompGetShiftedRank

cudecompResult_t cudecompGetShiftedRank(cudecompHandle_t handle, cudecompGridDesc_t grid_desc, int32_t axis, int32_t dim, int32_t displacement, bool periodic, int32_t *shifted_rank)

Function to retrieve the global rank of neighboring processes.

Parameters:

handle – [in] The initialized cuDecomp library handle
grid_desc – [in] A cuDecomp grid descriptor
axis – [in] The domain axis the pencil is aligned with
dim – [in] Which pencil dimension (global indexed) to retrieve neighboring rank
displacement – [in] Displacement of neighboring rank to retrieve. For example, 1 will retrieve the +1-th neighbor rank along dim, while -1 will retrieve the -1-th neighbor rank.
periodic – [in] A boolean flag to indicate whether dim should be treated periodically
shifted_rank – [out] A pointer to an integer to write the global rank of the requested neighbor. For non-periodic cases, a value of -1 will be written if the displacement results in a position outside the global domain.

Returns:

CUDECOMP_RESULT_SUCCESS on success or error code on failure.

Transposition Functions

cudecompTransposeXToY

cudecompResult_t cudecompTransposeXToY(cudecompHandle_t handle, cudecompGridDesc_t grid_desc, void *input, void *output, void *work, cudecompDataType_t dtype, const int32_t input_halo_extents[], const int32_t output_halo_extents[], const int32_t input_padding[], const int32_t output_padding[], cudaStream_t stream)

Function to transpose data from X-axis aligned pencils to a Y-axis aligned pencils.

Parameters:

handle – [in] The initialized cuDecomp library handle
grid_desc – [in] A cuDecomp grid descriptor
input – [in] A pointer to the memory buffer to read input X-axis aligned pencil data
output – [out] A pointer to the memory buffer to write output Y-axis aligned pencil data. If input and output are the same, operation is performed in-place
work – [in] A pointer to the transpose workspace memory
dtype – [in] The cuDecomp datatype to use for the transpose operation
input_halo_extents – [in] An array of three integers to define halo region extents of the input data, in global order. The i-th entry in this array should contain the number of halo elements (per direction) expected in the along the i-th global domain axis. Symmetric halos are assumed (e.g. a value of one in halo_extents means there are 2 halo elements, one element on each side). If the input has no halo regions, a NULL pointer can be provided.
output_halo_extents – [in] Similar to input_halo_extents, but for the output data. If the output has no halo regions, a NULL pointer can be provided.
input_padding – [in] An array of three integers to define padding of the input data, in global order. The i-th entry in this array should contain the number of elements to treat as padding in the i-th global domain axis. If the input has no padding, a NULL pointer can be provided.
output_padding – [in] Similar to input_padding, but for the output data. If the output has no padding, a NULL pointer can be provided.
stream – [in] CUDA stream to enqueue GPU operations into

Returns:

CUDECOMP_RESULT_SUCCESS on success or error code on failure.

cudecompTransposeYtoZ

cudecompResult_t cudecompTransposeYToZ(cudecompHandle_t handle, cudecompGridDesc_t grid_desc, void *input, void *output, void *work, cudecompDataType_t dtype, const int32_t input_halo_extents[], const int32_t output_halo_extents[], const int32_t input_padding[], const int32_t output_padding[], cudaStream_t stream)

Function to transpose data from Y-axis aligned pencils to a Z-axis aligned pencils.

Parameters:

handle – [in] The initialized cuDecomp library handle
grid_desc – [in] A cuDecomp grid descriptor
input – [in] A pointer to the memory buffer to read input Y-axis aligned pencil data
output – [out] A pointer to the memory buffer to write output Z-axis aligned pencil data. If input and output are the same, operation is performed in-place
work – [in] A pointer to the transpose workspace memory
dtype – [in] The cuDecomp datatype to use for the transpose operation
input_halo_extents – [in] An array of three integers to define halo region extents of the input data, in global order. The i-th entry in this array should contain the number of halo elements (per direction) expected in the along the i-th global domain axis. Symmetric halos are assumed (e.g. a value of one in halo_extents means there are 2 halo elements, one element on each side). If the input has no halo regions, a NULL pointer can be provided.
output_halo_extents – [in] Similar to input_halo_extents, but for the output data. If the output has no halo regions, a NULL pointer can be provided.
input_padding – [in] An array of three integers to define padding of the input data, in global order. The i-th entry in this array should contain the number of elements to treat as padding in the i-th global domain axis. If the input has no padding, a NULL pointer can be provided.
output_padding – [in] Similar to input_padding, but for the output data. If the output has no padding, a NULL pointer can be provided.
stream – [in] CUDA stream to enqueue GPU operations into

Returns:

CUDECOMP_RESULT_SUCCESS on success or error code on failure.

cudecompTransposeZToY

cudecompResult_t cudecompTransposeZToY(cudecompHandle_t handle, cudecompGridDesc_t grid_desc, void *input, void *output, void *work, cudecompDataType_t dtype, const int32_t input_halo_extents[], const int32_t output_halo_extents[], const int32_t input_padding[], const int32_t output_padding[], cudaStream_t stream)

Function to transpose data from Z-axis aligned pencils to a Y-axis aligned pencils.

Parameters:

handle – [in] The initialized cuDecomp library handle
grid_desc – [in] A cuDecomp grid descriptor
input – [in] A pointer to the memory buffer to read input Z-axis aligned pencil data
output – [out] A pointer to the memory buffer to write output Y-axis aligned pencil data. If input and output are the same, operation is performed in-place
work – [in] A pointer to the transpose workspace memory
dtype – [in] The cuDecomp datatype to use for the transpose operation
input_halo_extents – [in] An array of three integers to define halo region extents of the input data, in global order. The i-th entry in this array should contain the number of halo elements (per direction) expected in the along the i-th global domain axis. Symmetric halos are assumed (e.g. a value of one in halo_extents means there are 2 halo elements, one element on each side). If the input has no halo regions, a NULL pointer can be provided.
output_halo_extents – [in] Similar to input_halo_extents, but for the output data. If the output has no halo regions, a NULL pointer can be provided.
input_padding – [in] An array of three integers to define padding of the input data, in global order. The i-th entry in this array should contain the number of elements to treat as padding in the i-th global domain axis. If the input has no padding, a NULL pointer can be provided.
output_padding – [in] Similar to input_padding, but for the output data. If the output has no padding, a NULL pointer can be provided.
stream – [in] CUDA stream to enqueue GPU operations into

Returns:

CUDECOMP_RESULT_SUCCESS on success or error code on failure.

cudecompTransposeYToX

cudecompResult_t cudecompTransposeYToX(cudecompHandle_t handle, cudecompGridDesc_t grid_desc, void *input, void *output, void *work, cudecompDataType_t dtype, const int32_t input_halo_extents[], const int32_t output_halo_extents[], const int32_t input_padding[], const int32_t output_padding[], cudaStream_t stream)

Function to transpose data from Y-axis aligned pencils to a X-axis aligned pencils.

Parameters:

handle – [in] The initialized cuDecomp library handle
grid_desc – [in] A cuDecomp grid descriptor
input – [in] A pointer to the memory buffer to read input Y-axis aligned pencil data
output – [out] A pointer to the memory buffer to write output X-axis aligned pencil data. If input and output are the same, operation is performed in-place
work – [in] A pointer to the transpose workspace memory
dtype – [in] The cuDecomp datatype to use for the transpose operation
input_halo_extents – [in] An array of three integers to define halo region extents of the input data, in global order. The i-th entry in this array should contain the number of halo elements (per direction) expected in the along the i-th global domain axis. Symmetric halos are assumed (e.g. a value of one in halo_extents means there are 2 halo elements, one element on each side). If the input has no halo regions, a NULL pointer can be provided.
output_halo_extents – [in] Similar to input_halo_extents, but for the output data. If the output has no halo regions, a NULL pointer can be provided.
input_padding – [in] An array of three integers to define padding of the input data, in global order. The i-th entry in this array should contain the number of elements to treat as padding in the i-th global domain axis. If the input has no padding, a NULL pointer can be provided.
output_padding – [in] Similar to input_padding, but for the output data. If the output has no padding, a NULL pointer can be provided.
stream – [in] CUDA stream to enqueue GPU operations into

Returns:

CUDECOMP_RESULT_SUCCESS on success or error code on failure.

Halo Exchange Functions

cudecompUpdateHalosX

cudecompResult_t cudecompUpdateHalosX(cudecompHandle_t handle, cudecompGridDesc_t grid_desc, void *input, void *work, cudecompDataType_t dtype, const int32_t halo_extents[], const bool halo_periods[], int32_t dim, const int32_t padding[], cudaStream_t stream)

Function to perform halo communication of X-axis aligned pencil data.

Parameters:

handle – [in] The initialized cuDecomp library handle
grid_desc – [in] A cuDecomp grid descriptor
input – [inout] A pointer to the memory buffer to read input X-axis aligned pencil data. On successful completion, this buffer will contain the input X-axis aligned pencil data with the specified halo regions updated.
work – [in] A pointer to the halo workspace memory
dtype – [in] The cuDecomp datatype to use for the halo operation
halo_extents – [in] An array of three integers to define halo region extents of the input data, in global order. The i-th entry in this array should contain the number of halo elements (per direction) expected in the along the i-th global domain axis. Symmetric halos are assumed (e.g. a value of one in halo_extents means there are 2 halo elements, one element on each side)
halo_periods – [in] An array of three booleans to define halo periodicity of the input data, in global order. If the i-th entry in this array is true, the domain is treated periodically along the i-th global domain axis. A NULL pointer can be provided if none of the domain axes are periodic.
dim – [in] Which pencil dimension (global indexed) to perform the halo update
padding – [in] An array of three integers to define padding of the input data, in global order. The i-th entry in this array should contain the number of elements to treat as padding in the i-th global domain axis. If the input has no padding, a NULL pointer can be provided.
stream – [in] CUDA stream to enqueue GPU operations into

Returns:

CUDECOMP_RESULT_SUCCESS on success or error code on failure.

cudecompUpdateHalosY

cudecompResult_t cudecompUpdateHalosY(cudecompHandle_t handle, cudecompGridDesc_t grid_desc, void *input, void *work, cudecompDataType_t dtype, const int32_t halo_extents[], const bool halo_periods[], int32_t dim, const int32_t padding[], cudaStream_t stream)

Function to perform halo communication of Y-axis aligned pencil data.

Parameters:

handle – [in] The initialized cuDecomp library handle
grid_desc – [in] A cuDecomp grid descriptor
input – [inout] A pointer to the memory buffer to read input Y-axis aligned pencil data. On successful completion, this buffer will contain the input Y-axis aligned pencil data with the specified halo regions updated.
work – [in] A pointer to the halo workspace memory
dtype – [in] The cuDecomp datatype to use for the halo operation
halo_extents – [in] An array of three integers to define halo region extents of the input data, in global order. The i-th entry in this array should contain the number of halo elements (per direction) expected in the along the i-th global domain axis. Symmetric halos are assumed (e.g. a value of one in halo_extents means there are 2 halo elements, one element on each side)
halo_periods – [in] An array of three booleans to define halo periodicity of the input data, in global order. If the i-th entry in this array is true, the domain is treated periodically along the i-th global domain axis.
dim – [in] Which pencil dimension (global indexed) to perform the halo update
padding – [in] An array of three integers to define padding of the input data, in global order. The i-th entry in this array should contain the number of elements to treat as padding in the i-th global domain axis. If the input has no padding, a NULL pointer can be provided.
stream – [in] CUDA stream to enqueue GPU operations into

Returns:

CUDECOMP_RESULT_SUCCESS on success or error code on failure.

cudecompUpdateHalosZ

cudecompResult_t cudecompUpdateHalosZ(cudecompHandle_t handle, cudecompGridDesc_t grid_desc, void *input, void *work, cudecompDataType_t dtype, const int32_t halo_extents[], const bool halo_periods[], int32_t dim, const int32_t padding[], cudaStream_t stream)

Function to perform halo communication of Z-axis aligned pencil data.

Parameters:

handle – [in] The initialized cuDecomp library handle
grid_desc – [in] A cuDecomp grid descriptor
input – [inout] A pointer to the memory buffer to read input Z-axis aligned pencil data. On successful completion, this buffer will contain the input Z-axis aligned pencil data with the specified halo regions updated.
work – [in] A pointer to the halo workspace memory
dtype – [in] The cuDecomp datatype to use for the halo operation
halo_extents – [in] An array of three integers to define halo region extents of the input data, in global order. The i-th entry in this array should contain the number of halo elements (per direction) expected in the along the i-th global domain axis. Symmetric halos are assumed (e.g. a value of one in halo_extents means there are 2 halo elements, one element on each side)
halo_periods – [in] An array of three booleans to define halo periodicity of the input data, in global order. If the i-th entry in this array is true, the domain is treated periodically along the i-th global domain axis.
dim – [in] Which pencil dimension (global indexed) to perform the halo update
padding – [in] An array of three integers to define padding of the input data, in global order. The i-th entry in this array should contain the number of elements to treat as padding in the i-th global domain axis. If the input has no padding, a NULL pointer can be provided.
stream – [in] CUDA stream to enqueue GPU operations into

Returns:

CUDECOMP_RESULT_SUCCESS on success or error code on failure.