Moe Align Sum Kernels#
- void trt_edgellm::kernel::launchCountSlotsPerExpertKernel(
- int32_t const *topkIndices,
- int32_t *slotsPerExpertWorkspace,
- int32_t numTokens,
- int32_t topK,
- int32_t numExperts,
- cudaStream_t stream
Host launcher: count slots per expert (shared-memory reduction). Padded num experts and experts-per-warp are derived internally.
- void trt_edgellm::kernel::launchComputePaddedOffsetsKernel(
- int32_t const *counts,
- int32_t *paddedCounts,
- int32_t *paddedOffsets,
- int32_t *numTokensPostPadded,
- int32_t numExperts,
- int32_t moeBlockSize,
- cudaStream_t stream
Host launcher: compute padded offsets (CUB BlockScan prefix sum).
- void trt_edgellm::kernel::launchBuildSlotListsKernel(
- int32_t const *topkIndices,
- int32_t *slotsByExpertWorkspace,
- int32_t *slotsPerExpertWorkspace,
- int32_t numTokens,
- int32_t topK,
- int32_t numExperts,
- cudaStream_t stream
Host launcher: build slot lists per expert (atomic offsets). Num SMs are queried from the device internally.