p_qdq

Softmax-P quant-dequant helpers for the unified flash attention kernel.

These @triton.jit helpers fake-quantize the softmax probabilities P before the P @ V matmul (BMM2) — the in-kernel counterpart of the p_bmm_quantizer config. They are called conditionally from the baseline flash-attention kernel in common/attention/triton_fa.py under the P_QDQ constexpr guard, following the same composition pattern as the sparsity helpers in sparsity/attention/skip_softmax_helpers.py.

Only NVFP4 needs a P-specific helper (tiling policy and block amaxes); the per-tensor FP8 mode uses quantization/common/fp8_quant.fp8_scalar_qdq directly. What is P-specific here: the kernel’s online-softmax p is unnormalized and bounded (0 <= p <= 1, since the max-subtraction caps every entry at exp2(0) = 1), so 1 is the theoretical upper bound of its amax; block amaxes need no abs; and the NVFP4 scale blocks of 16 run along the key dimension — the contraction axis of P @ V. The caller (attention()) converts the amax to the global_scale below.