v_qdq

Value-operand NVFP4 helpers for flash attention.

Functions

fake_quant_v_onwrite

NVFP4-finalize complete block-16 groups in [v_lo, v_hi) in place.

fake_quant_v_onwrite(v_cache, block_table, v_lo, v_hi, *, max_new_tokens, page_size=16, v_qdq_scale=1.0, decode=False)

NVFP4-finalize complete block-16 groups in [v_lo, v_hi) in place.

max_new_tokens is host metadata used to size the masked launch grid. Decode uses one fixed group per request. Eager prefill covers every group that the largest query chunk can complete without reading device metadata. v_lo and v_hi must describe aligned, completed block-16 boundaries; their device values are not host-validated.

Parameters:

v_cache (Tensor)
block_table (Tensor)
v_lo (Tensor)
v_hi (Tensor)
max_new_tokens (int)
page_size (int)
v_qdq_scale (float)
decode (bool)

Return type:

None