Matmul (Blackwell)¶

This tutorial shows how to implement a high-performance matrix multiplication kernel (C = A x B^T) targeting NVIDIA Blackwell GPUs using Tilus.

Starting from a minimal working kernel, each version introduces one new Blackwell feature or optimization technique. By the final version, the kernel reaches vendor-library-level performance. The figure below shows the progression: V0 starts at ~491 TFLOPS with a minimal kernel, and each optimization closes the gap to cuBLAS, with V6 matching it at ~1610 TFLOPS. All kernels and the benchmark script to reproduce the result can be found at examples/blackwell_matmul/.

../../_images/plot_all.svg — Blackwell matmul performance on B200 (M=N=K=8192, fp16). TFLOPS derived from NCU profiling. Peak TFLOPS estimated from cuBLAS tensor core utilization (96.6%).¶

Versions

0. A Minimal Blackwell Matmul
1. TMA Loads and TMA Epilogue
2. Multi-Stage Software Pipelining
3. Warp Specialization
4. Tile Rasterization and Pipeline Abstraction
5. CLC Persistent Kernel and Pipelined Epilogue
6. 2-CTA Cluster