nvfp4_tensor
Implements NVFP4 quantization for efficient tensor storage and computation.
Classes
Implements the INT4 quantization on tensors for more efficient storage or computation. |
- class NVFP4QTensor
Bases:
BaseQuantizedTensor
Implements the INT4 quantization on tensors for more efficient storage or computation.
- quantized_data
The quantized data stored as a packed uint8 tensor.
- Type:
torch.Tensor
- dequantize(dtype=None, **kwarg)
Dequantze NVFP4 packed tensor to a target dtype.
- Parameters:
dtype (dtype)
- e2m1_values_on_device = {}
- classmethod get_activation_scaling_factor(quantizer)
Returns the activation scaling factor for export.
- classmethod get_e2m1_values(device)
Returns the e2m1 values on the device.
- classmethod get_weights_scaling_factor(input, block_size, weights_scaling_factor_2=None, keep_high_precision=False)
Returns quantized per block weight scaling factor.
- Parameters:
input (Tensor)
block_size (int)
weights_scaling_factor_2 (Tensor | None)
keep_high_precision (bool)
- classmethod get_weights_scaling_factor_2(input)
Returns per tensor weight scaling factor.
- Parameters:
input (Tensor)
- classmethod get_weights_scaling_factor_2_from_quantizer(weight_quantizer)
Returns per tensor weight scaling factor from the weight_quantizer amax.
- classmethod quantize(input, block_size, weights_scaling_factor=None, weights_scaling_factor_2=None, keep_high_precision=False, try_tensorrt=False)
Converting a tensor to a quantized format based on NVFP4 quantization.
- Parameters:
input (torch.Tensor) – The input tensor to be quantized.
block_size (int) – The size of each block for quantization.
weights_scaling_factor (torch.Tensor) – The scaling factor for the weights.
weights_scaling_factor_2 (torch.Tensor) – The scaling factor for the weights.
keep_high_precision (bool) – Whether to keep output scales at high precision.
try_tensorrt (bool)
Returns: tuple: Contains quantized data, quantized per block scaling factor, and per tensor scaling factor.