Kernels#
-
namespace trt_edgellm
Enums
Functions
- inline bool moeW4a4DecodeIsValidThreadBlockSize(
- int const thread_block_size
Supported
thread_block_sizevalues for W4A4 decode GEMV kernels (multiple of warp size).
- inline int nemotronMoeW4A16DecodeThreadBlockSizeForInterDim(
- int const inter_dim
Picks W4A16 decode block size from
inter_dim:largest of 256, 128, 96, 64 that dividesinter_dim.- Returns:
0 if
inter_dimis not divisible by 64.
- inline int nemotronMoeW4A4DecodeThreadBlockSizeForDims(
- int const hidden_dim,
- int const inter_dim
Picks a
thread_block_sizevalid for moeW4a4DecodeIsValidThreadBlockSize that divides bothhidden_dimandinter_dim. W4A4 decode uses one block size for up (strips alonghidden_dim) and down (strips alonginter_dim).- Returns:
0 if no candidate divides both dimensions.
- inline int moeDecodeGemvTopkGridDim(
- int const batch_size,
- int const top_k,
- int const inter_dim,
- int const thread_block_size
Grid x-dimension for MoE decode GEMV with explicit top-k routing: one block per (token row, top-k slot,
thread_block_size-sizedstrip ofinter_dim).inter_dimmust be a positive multiple ofthread_block_size. Token rows arebatch* seq_len (flattened row-major[batch, seq_len, …]).
- inline int moeDecodeGemvTopkGridDimBatchSeq(
- int const batch,
- int const seq_len,
- int const top_k,
- int const strip_dim,
- int const thread_block_size
Same as moeDecodeGemvTopkGridDim with
num_tokens=batch*seq_len. Passstrip_dim=hidden_dimfor W4A16 up (strips along hidden) orinter_dimfor W4A16 down (strips along intermediate).
- inline int moeW4a4DecodeUpGridDim(
- int const batch,
- int const seq_len,
- int const top_k,
- int const hidden_dim,
- int const thread_block_size
Grid x-dimension for W4A4 decode up kernel: strips along
hidden_dim(same as launchNemotronMoeW4A4DecodeUpGemvCuda).num_tokens=batch*seq_len(flattened row-major tokens).
- inline int moeDecodeGemvTopkThreads(
- int const batch_size,
- int const top_k,
- int const inter_dim
Per-intermediate-index MACs for top-k MoE decode GEMV:
batch_size* top_k * inter_dim.
- inline int64_t nemotronMoeW4A16InterBufferNumElems(
- int batch_size,
- int top_k,
- int inter_dim
Element count for W4A16/W4A4 split intermediate tensor
[batch * seq_len, top_k, inter_dim] (row-major); stored as FP16 between up and down.
- inline int64_t nemotronMoeW4A16UpFp16ScratchBytes(
- int batch_size,
- int top_k,
- int inter_dim
Device scratch bytes: FP16 row-major
[batch * seq_len, top_k, inter_dim] (up-proj dotz), passed to down-proj as__half*.
Variables
-
int kDefaultMlpW4a4DecodeThreadBlockSize = 128#
Default CUDA block size for W4A4 decode: one block size must divide both
hidden_dim(up strips) andinter_dim(down strips); see nemotronMoeW4A4DecodeThreadBlockSizeForDims.hidden_dimmust be divisible by 64.
-
int kMaxDecodingKernelWarpCount = 16#
Upper bound on warps per block for MoE decode GEMV shared scratch (
accumulateNvfp4GemvTileWarpReduceinmarlin_template.cuh). Must be at least256/ 32 for current largestthread_block_size.