layer_norm.h

LayerNorm functions.

Functions

void nvte_layernorm_fwd(const NVTETensor x, const NVTETensor gamma, const NVTETensor beta, const float epsilon, NVTETensor z, NVTETensor mu, NVTETensor rsigma, cudaStream_t stream, const int multiprocessorCount, NVTETensor workspace, NVTETensor barrier)

Compute LayerNorm on the input.

The formula used:

\[ y = \frac{x - E[x]}{\sqrt{Var[x] + \varepsilon}}\gamma + \beta \]

Calling this function with workspace and barrier set to empty tensor will not perform the operation, but instead set the shape and type of the workspace and barrier tensors to the required values.

Parameters
  • x[in] Input tensor of shape [N, H].

  • gamma[in] Gamma tensor of shape [H].

  • beta[in] Beta tensor of shape [H].

  • epsilon[in] Value added to denominator for numerical stability.

  • z[inout] Output tensor of shape [N, H].

  • mu[out] Mean of the input calculated over the last dimension. Shape: [N].

  • rsigma[out] Inverse of the variance of the input calculated over the last dimension. Shape: [N].

  • stream[in] CUDA stream used for the operation.

  • multiprocessorCount[in] Number of SMs in the device.

  • workspace[out] Workspace tensor.

  • barrier[out] Barrier tensor.

void nvte_layernorm1p_fwd(const NVTETensor x, const NVTETensor gamma, const NVTETensor beta, const float epsilon, NVTETensor z, NVTETensor mu, NVTETensor rsigma, cudaStream_t stream, const int multiprocessorCount, NVTETensor workspace, NVTETensor barrier)

Compute LayerNorm with zero-centered gamma on the input.

The formula used:

\[ y = \frac{x - E[x]}{\sqrt{Var[x] + \varepsilon}}(1 + \gamma) + \beta \]

Calling this function with workspace and barrier set to empty tensor will not perform the operation, but instead set the shape and type of the workspace and barrier tensors to the required values.

Parameters
  • x[in] Input tensor of shape [N, H].

  • gamma[in] Gamma tensor of shape [H].

  • beta[in] Beta tensor of shape [H].

  • epsilon[in] Value added to denominator for numerical stability.

  • z[inout] Output tensor of shape [N, H].

  • mu[out] Mean of the input calculated over the last dimension. Shape: [N].

  • rsigma[out] Inverse of the variance of the input calculated over the last dimension. Shape: [N].

  • stream[in] CUDA stream used for the operation.

  • multiprocessorCount[in] Number of SMs in the device.

  • workspace[out] Workspace tensor.

  • barrier[out] Barrier tensor.

void nvte_layernorm_bwd(const NVTETensor dz, const NVTETensor x, const NVTETensor mu, const NVTETensor rsigma, const NVTETensor gamma, NVTETensor dx, NVTETensor dgamma, NVTETensor dbeta, NVTETensor dgamma_part, NVTETensor dbeta_part, cudaStream_t stream, const int multiprocessorCount, NVTETensor workspace, NVTETensor barrier)

Compute backward of LayerNorm.

This function computes the gradient of function:

\[ y = \frac{x - E[x]}{\sqrt{Var[x] + \varepsilon}}\gamma + \beta \]
with respect to \(x\), \(\gamma\) and \(\beta\).

Calling this function with workspace, barrier, dgamma_part and dbeta_part set to empty tensor will not perform the operation, but instead set the shape and type of these tensors to the required values.

Parameters
  • dz[in] Incoming gradient tensor of shape [N, H].

  • x[in] Forward input tensor of shape [N, H].

  • mu[in] Mean of the input calculated over the last dimension. Shape: [N].

  • rsigma[in] Inverse of the variance of the input calculated over the last dimension. Shape: [N].

  • gamma[in] Gamma tensor of shape [H].

  • dx[out] Output gradient of shape [N, H].

  • dgamma[out] Gradient for gamma tensor of shape [H].

  • dbeta[out] Gradient for beta tensor of shape [H].

  • dgamma_part[out] Storage for partial gamma gradient.

  • dbeta_part[out] Storage for partial bias gradient.

  • stream[in] CUDA stream used for the operation.

  • multiprocessorCount[in] Number of SMs in the device.

  • workspace[out] Workspace tensor.

  • barrier[out] Barrier tensor.

void nvte_layernorm1p_bwd(const NVTETensor dz, const NVTETensor x, const NVTETensor mu, const NVTETensor rsigma, const NVTETensor gamma, NVTETensor dx, NVTETensor dgamma, NVTETensor dbeta, NVTETensor dgamma_part, NVTETensor dbeta_part, cudaStream_t stream, const int multiprocessorCount, NVTETensor workspace, NVTETensor barrier)

Compute backward of LayerNorm with zero-centered gamma.

This function computes the gradient of function:

\[ y = \frac{x - E[x]}{\sqrt{Var[x] + \varepsilon}}(1 + \gamma) + \beta \]
with respect to \(x\), \(\gamma\) and \(\beta\).

Calling this function with workspace, barrier, dgamma_part and dbeta_part set to empty tensor will not perform the operation, but instead set the shape and type of these tensors to the required values.

Parameters
  • dz[in] Incoming gradient tensor of shape [N, H].

  • x[in] Forward input tensor of shape [N, H].

  • mu[in] Mean of the input calculated over the last dimension. Shape: [N].

  • rsigma[in] Inverse of the variance of the input calculated over the last dimension. Shape: [N].

  • gamma[in] Gamma tensor of shape [H].

  • dx[out] Output gradient of shape [N, H].

  • dgamma[out] Gradient for gamma tensor of shape [H].

  • dbeta[out] Gradient for beta tensor of shape [H].

  • dgamma_part[out] Storage for partial gamma gradient.

  • dbeta_part[out] Storage for partial bias gradient.

  • stream[in] CUDA stream used for the operation.

  • multiprocessorCount[in] Number of SMs in the device.

  • workspace[out] Workspace tensor.

  • barrier[out] Barrier tensor.