fp4_kernel
NVFP4 Fake Quantization Triton Implementation.
This module provides high-performance GPU implementations of NVFP4 fake quantization operations using Triton kernels.
Functions
Applies FP4 fake quantization on the input tensor. |
- fp4_fake_quant_block(x, global_amax, block_size=16, tile_size=128)
Applies FP4 fake quantization on the input tensor.
- Parameters:
x (torch.Tensor) – Input tensor of shape (M, N)
global_amax (torch.Tensor) – Global max value of the input tensor This needs to be a tensor to be cuda-graph compatible
block_size (int) – Size of FP4 quantization blocks
tile_size (int) – Size of processing blocks
- Returns:
Quantized tensor of the same shape as input
- Return type:
torch.Tensor