Cute Dsl GEMM Runner#

class CuteDslGemmRunner#

Runner for CuTe DSL compiled GEMM kernels, replacing cuBLAS for Talker MLP.

Provides FP16 GEMM with FP32 accumulation: C = A @ B^T where A is [M, K], B is [N, K] (row-major), C is [M, N].

Multiple architecture variants are compiled AOT and selected at runtime based on GPU SM version:

  • Ampere (SM 80/86/87): cp.async + MmaF16BF16Op

  • Blackwell datacenter (SM 100/101/103/110): tcgen05 + TMA

  • Blackwell GeForce (SM 120/121): WGMMA + TMA

Public Static Functions

static bool canImplement(int32_t smVersion)#

Check if CuTe DSL GEMM can run on this GPU.

Parameters:

smVersion – GPU SM version (e.g. 87, 100, 121)

Returns:

true if a compiled GEMM variant exists for this SM

static bool loadKernelModule()#

Load the kernel module (thread-safe, idempotent).

static void unloadKernelModule()#

Unload the kernel module.

static bool run(
void const *aPtr,
void const *bPtr,
void *cPtr,
int32_t M,
int32_t N,
int32_t K,
cudaStream_t stream
)#

Execute GEMM: C[M,N] = A[M,K] @ B[N,K]^T.

All tensors are FP16, row-major. Accumulation is FP32.

Parameters:
  • aPtr – Input A [M, K] (FP16, K contiguous)

  • bPtr – Weight B [N, K] (FP16, K contiguous)

  • cPtr – Output C [M, N] (FP16, N contiguous)

  • M – Number of rows in A / C

  • N – Number of rows in B / columns in C

  • K – Inner dimension

  • stream – CUDA stream

Returns:

true on success, false if kernel module not loaded or variant unavailable.

static bool runBiasSiLU(
void const *aPtr,
void const *bPtr,
void *cPtr,
void const *biasPtr,
int32_t M,
int32_t N,
int32_t K,
cudaStream_t stream
)#

Execute fused GEMM + bias + SiLU: C = SiLU(A @ B^T + bias)

Uses AOT-compiled fused epilogue kernel on all architectures (Ampere, Blackwell DC, BW GeForce). Falls back to plain GEMM + separate CUDA kernel if the fused variant is not compiled for the current arch.

Parameters:

biasPtr – Bias vector [N] (FP16)

Returns:

true on success

static bool runBias(
void const *aPtr,
void const *bPtr,
void *cPtr,
void const *biasPtr,
int32_t M,
int32_t N,
int32_t K,
cudaStream_t stream
)#

Execute fused GEMM + bias: C = A @ B^T + bias.

Uses AOT-compiled fused epilogue kernel on all architectures. Falls back to plain GEMM + separate CUDA kernel if the fused variant is not compiled.

Parameters:

biasPtr – Bias vector [N] (FP16)

Returns:

true on success