p_qdq
Softmax-P quant-dequant helpers for the unified flash attention kernel.
These @triton.jit helpers fake-quantize the softmax probabilities P
before the P @ V matmul (BMM2) — the in-kernel counterpart of the
p_bmm_quantizer config. They are called conditionally from the baseline
flash-attention kernel in common/attention/triton_fa.py under the
P_QDQ constexpr guard, following the same composition pattern as
the sparsity helpers in sparsity/attention/skip_softmax_helpers.py.
Only NVFP4 needs a P-specific helper (tiling policy and block amaxes); the
per-tensor FP8 mode uses quantization/common/fp8_quant.fp8_scalar_qdq
directly. What is P-specific here: the kernel’s online-softmax p is
unnormalized and bounded (0 <= p <= 1, since the max-subtraction caps
every entry at exp2(0) = 1), so 1 is the theoretical upper bound of its
amax; block amaxes need no abs; and the NVFP4 scale blocks of 16 run
along the key dimension — the contraction axis of P @ V. The caller
(attention()) converts the amax to the global_scale below.