fp8_tensor

Implements FP8 quantization for efficient tensor storage and computation.

Classes

FP8QTensor

Implements the FP8 quantization on tensors for more efficient storage or computation.

class FP8QTensor

Bases: BaseQuantizedTensor

Implements the FP8 quantization on tensors for more efficient storage or computation.

quantized_data

The quantized data stored as a packed fp8 tensor.

Type:

torch.Tensor

dequantize(dtype=None, **kwarg)

Dequantze FP8 packed tensor to a target dtype.

Parameters:

dtype (dtype) –

classmethod quantize(input, scales=None, axis=None, block_sizes=None)

Converting a tensor to a quantized format based on FP8 quantization. Only E4M3 is supported.

Parameters:
  • input (torch.Tensor) – The input tensor to be quantized.

  • scales (torch.Tensor) – The scales for quantization.

  • axis (tuple | int | None) – The dimensions to reduce for quantization. None or int or tuple of ints.

  • block_sizes (dict) – A dictionary specifying the block size for each dimension.

Return type:

tuple

Note: One can only provide axis or block_sizes for FP8 quantization.

Returns:

FP8QTensor, scales

Return type:

tuple

Parameters:
  • input (Tensor) –

  • scales (Tensor) –

  • axis (tuple | int | None) –

  • block_sizes (dict) –