Moe Align Sum Kernels#

void trt_edgellm::kernel::launchCountSlotsPerExpertKernel(
int32_t const *topkIndices,
int32_t *slotsPerExpertWorkspace,
int32_t numTokens,
int32_t topK,
int32_t numExperts,
cudaStream_t stream
)#

Host launcher: count slots per expert (shared-memory reduction). Padded num experts and experts-per-warp are derived internally.

void trt_edgellm::kernel::launchComputePaddedOffsetsKernel(
int32_t const *counts,
int32_t *paddedCounts,
int32_t *paddedOffsets,
int32_t *numTokensPostPadded,
int32_t numExperts,
int32_t moeBlockSize,
cudaStream_t stream
)#

Host launcher: compute padded offsets (CUB BlockScan prefix sum).

void trt_edgellm::kernel::launchBuildSlotListsKernel(
int32_t const *topkIndices,
int32_t *slotsByExpertWorkspace,
int32_t *slotsPerExpertWorkspace,
int32_t numTokens,
int32_t topK,
int32_t numExperts,
cudaStream_t stream
)#

Host launcher: build slot lists per expert (atomic offsets). Num SMs are queried from the device internally.