quantize¶

nvtripy.quantize(input: Tensor, scale: Tensor | Number | Sequence[Number] | Sequence[Sequence[Number]], dtype: dtype, dim: int | None = None) → Tensor[source]¶

Quantizes the input Tensor. The valid quantized data types are nvtripy.int8, nvtripy.int4, nvtripy.float8.

If dtype is nvtripy.int4, the result of this function cannot be printed as nvtripy.int4 is an internal quantized data type. It must be dequantized dequantize() to a higher precision first.

If dim is not given, this function will perform “per-tensor” or “block-wise” quantization.

For “per-tensor” quantization, the scale must be a scalar tensor or a single python number.
For “block-wise” quantization, the dtype must only be nvtripy.int4. The input tensor must only have 2 dimensions, e.g. [D0, D1]. The scale must also be a 2-D tensor or a 2-D python sequence. The first dimension of scale must be able to divide D0, where “blocking” is performed. The second dimension of scale must equal to D1.

If dim is given, this function will perform “per-channel” quantization. The scale must be a 1-D tensor or a python sequence both with size of input.shape[dim].

Parameters:

input (Tensor) – [dtype=T1] The input tensor.
scale (Tensor | Number | Sequence[Number] | Sequence[Sequence[Number]]) – [dtype=T1] The scale tensor. Must be a constant tensor.
dtype (dtype) – [dtype=T2] The quantization data type. Must be a valid quantized data type (see above).
dim (int | None) – The dimension for per-channel quantization

Returns:

[dtype=T2] Quantized Tensor.

Return type:

Tensor

DATA TYPE CONSTRAINTS:

T1: float16, bfloat16, float32
T2: float8, int4, int8

Example: Per-tensor quantization

input = tp.reshape(tp.arange(6, tp.float32), (2, 3))
scale = 0.99872
# output = tp.quantize(input, scale, tp.int8)

# assert np.array_equal(cp.from_dlpack(output).get(), expected)

Local Variables¶

>>> input
tensor(
    [[0, 1, 2],
     [3, 4, 5]], 
    dtype=float32, loc=gpu:0, shape=(2, 3))

Example: Per-channel quantization

input = tp.Tensor([[0.0, 1.0, 2.0], [3.0, 4.0, 5.0]])
scale = [0.99872, 0.96125]
output = tp.quantize(input, scale, tp.int8, dim=0)

Local Variables¶

>>> input
tensor(
    [[0, 1, 2],
     [3, 4, 5]], 
    dtype=float32, loc=cpu:0, shape=(2, 3))

>>> output
tensor(
    [[0, 1, 2],
     [3, 4, 5]], 
    dtype=int8, loc=gpu:0, shape=(2, 3))

Example: Block-wise quantization

input = tp.Tensor([[0.0, 1.0], [2.0, 3.0]])
scale = [[1.0, 1.0]]
quant = tp.quantize(input, scale, tp.int4)
output = tp.dequantize(quant, scale, tp.float32)

Local Variables¶

>>> output
tensor(
    [[0, 1],
     [2, 3]], 
    dtype=float32, loc=gpu:0, shape=(2, 2))