fp8_quant

Composable Triton JIT functions for FP8 (E4M3) fake quantization.

Counterpart of nvfp4_quant.py for per-tensor FP8. Used by the unified flash-attention kernel’s softmax-P qdq (common/attention/triton_fa.py).