Build Layout#

void trt_edgellm::kernel::buildLayoutGpu(
MoELayoutBuffers &buf,
int32_t const *tokenSelectedExperts,
int32_t numTokens,
int32_t topK,
int32_t localNumExperts,
int32_t tileSize,
cudaStream_t stream
)#

GPU-side layout builder via single-CTA kernel (~3-5 us). All device pointers in buffers must be pre-allocated by the caller. tokenSelectedExperts must contain LOCAL expert indices in [0, L).