transpose.h¶

Functions handling transposes.

Functions

void nvte_cast_transpose(const NVTETensor input, NVTETensor cast_output, NVTETensor transposed_output, cudaStream_t stream)¶

Cast and transpose the input.

This function casts the input and produces 2 results:

cast_output is the result of the cast
transposed_output is the transposed result of the cast.

Parameters

input – [in] Input tensor of shape [N, H].
cast_output – [inout] Result of the cast. Shape: [N, H].
transposed_output – [inout] Result of the cast and transpose. Shape: [H, N].
stream – [in] CUDA stream used for the operation.

void nvte_transpose(const NVTETensor input, NVTETensor transposed_output, cudaStream_t stream)¶

Transpose the input.

Parameters

input – [in] Input tensor of shape [N, H].
transposed_output – [out] Result of the transpose. Shape: [H, N].
stream – [in] CUDA stream used for the operation.

void nvte_cast_transpose_dbias(const NVTETensor input, NVTETensor cast_output, NVTETensor transposed_output, NVTETensor dbias, NVTETensor workspace, cudaStream_t stream)¶

Cast and transpose the input. Additionally, reduce the input along the first dimension.

This function casts the input and produces 3 results:

cast_output is the result of the cast
transposed_output is the transposed result of the cast.
dbias is the result of the reduction of the input along the first dimension.

Calling this function with workspace being an empty tensor will not perform the operation, but instead set the shape and type of the workspace tensor to the required values.

Parameters

input – [in] Input tensor of shape [N, H].
cast_output – [inout] Result of the cast. Shape: [N, H].
transposed_output – [inout] Result of the cast and transpose. Shape: [H, N].
dbias – [out] Result of the reduction of the input along the first dimension. Shape: [H].
workspace – [out] Workspace tensor.
stream – [in] CUDA stream used for the operation.

void nvte_fp8_transpose_dbias(const NVTETensor input, NVTETensor transposed_output, NVTETensor dbias, NVTETensor workspace, cudaStream_t stream)¶

Transpose the FP8 input. Additionally, reduce the input along the first dimension.

This function takes FP8 input and produces 2 results:

transposed_output is the transposed result of the input.
dbias is the result of the reduction of the input along the first dimension.

Calling this function with workspace being an empty tensor will not perform the operation, but instead set the shape and type of the workspace tensor to the required values.

Parameters

input – [in] Input tensor of shape [N, H].
transposed_output – [inout] Result of the transpose. Shape: [H, N].
dbias – [out] Result of the reduction of the input along the first dimension. Shape: [H].
workspace – [out] Workspace tensor.
stream – [in] CUDA stream used for the operation.

void nvte_cast_transpose_dbias_dgelu(const NVTETensor input, const NVTETensor gelu_input, NVTETensor cast_output, NVTETensor transposed_output, NVTETensor dbias, NVTETensor workspace, cudaStream_t stream)¶

Compute backward of GELU operation on the input, then cast and transpose. Additionally, reduce the result of the GELU backward along the first dimension.

This function produces 3 results:

cast_output is equal to cast(dGELU(input))
transposed_output is equal to transpose(cast(dGELU(input)))
dbias is equal to reduce(dGELU(input), axis=0)

Calling this function with workspace being an empty tensor will not perform the operation, but instead set the shape and type of the workspace tensor to the required values.

Parameters

input – [in] Input tensor of shape [N, H].
gelu_input – [in] Tensor used as input to the forward of GELU operation. Shape [N, H].
cast_output – [inout] Result of the cast. Shape: [N, H].
transposed_output – [inout] Result of the cast and transpose. Shape: [H, N].
dbias – [out] Result of the reduction of the dGELU(input) along the first dimension. Shape: [H].
workspace – [out] Workspace tensor.
stream – [in] CUDA stream used for the operation.

void nvte_multi_cast_transpose(size_t num_tensors, const NVTETensor *input_list, NVTETensor *cast_output_list, NVTETensor *transposed_output_list, cudaStream_t stream)¶

Cast and transpose multiple tensors.

This function casts each input tensor and produces 2 results:

cast_output is the result of the cast
transposed_output is the transposed result of the cast.

Parameters

num_tensors – [in] Number of tensors.
input_list – [in] List of 2D input tensors.
cast_output_list – [inout] List of casted tensors. Dimensions match tensors in input_list.
transposed_output_list – [inout] List of casted and transposed tensors. Dimensions are transpose of tensors in input_list.
stream – [in] CUDA stream used for the operation.

void nvte_dgeglu_cast_transpose(const NVTETensor input, const NVTETensor geglu_input, NVTETensor cast_output, NVTETensor transposed_output, cudaStream_t stream)¶

Compute dgeglu of the input, additionally does cast and transpose the dgeglu output.

This function produces 2 results:

cast_output is the result of the cast
transposed_output is the transposed result of the cast.

Parameters

input – [in] Input tensor of shape [N, H].
geglu_input – [in] Tensor used as input to the forward of GeGLU operation. Shape [N, H * 2].
cast_output – [inout] Result of the cast. Shape: [N, H * 2].
transposed_output – [inout] Result of the cast and transpose. Shape: [H * 2, N].
stream – [in] CUDA stream used for the operation.