quantize

tripy.quantize(input: Tensor, scale: Tensor | Number | Sequence[Number] | Sequence[Sequence[Number]], dtype: dtype, dim: int | Any = None) Tensor[source]

Quantizes the input Tensor. The valid quantized data types are tripy.int8, tripy.int4, tripy.float8.

If dtype is tripy.int4, the result of this function cannot be printed as tripy.int4 is an internal quantized data type. It must be dequantized dequantize() to a higher precision first.

If dim is not given, this function will perform “per-tensor” or “block-wise” quantization.

  • For “per-tensor” quantization, the scale must be a scalar tensor or a single python number.

  • For “block-wise” quantization, the dtype must only be tripy.int4. The input tensor must only have 2 dimensions, e.g. [D0, D1]. The scale must also be a 2-D tensor or a 2-D python sequence. The first dimension of scale must be able to divide D0, where “blocking” is performed. The second dimension of scale must equal to D1.

If dim is given, this function will perform “per-channel” quantization. The scale must be a 1-D tensor or a python sequence both with size of input.shape[dim].

Parameters:
  • input (Tensor) – [dtype=T1] The input tensor.

  • scale (Tensor | Number | Sequence[Number] | Sequence[Sequence[Number]]) – [dtype=T1] The scale tensor. Must be a constant tensor.

  • dtype (dtype) – [dtype=T2] The quantization data type. Must be a valid quantized data type (see above).

  • dim (int | Any) – The dimension for per-channel quantization

Returns:

[dtype=T2] Quantized Tensor.

Return type:

Tensor

TYPE CONSTRAINTS:
Example: Per-tensor quantization
Per-tensor quantization
1input = tp.reshape(tp.arange(6, tp.float32), (2, 3))
2scale = 0.99872
3# output = tp.quantize(input, scale, tp.int8)
4
5# assert np.array_equal(cp.from_dlpack(output).get(), expected)
>>> input
tensor(
    [[0.0000, 1.0000, 2.0000],
     [3.0000, 4.0000, 5.0000]], 
    dtype=float32, loc=gpu:0, shape=(2, 3))
Example: Per-channel quantization
Per-channel quantization
1input = tp.Tensor([[0, 1, 2], [3, 4, 5]], dtype=tp.float32)
2scale = [0.99872, 0.96125]
3output = tp.quantize(input, scale, tp.int8, dim=0)
>>> input
tensor(
    [[0.0000, 1.0000, 2.0000],
     [3.0000, 4.0000, 5.0000]], 
    dtype=float32, loc=gpu:0, shape=(2, 3))
>>> output
tensor(
    [[0, 1, 2],
     [3, 4, 5]], 
    dtype=int8, loc=gpu:0, shape=(2, 3))
Example: Block-wise quantization
Block-wise quantization
1input = tp.Tensor([[0, 1], [2, 3]], dtype=tp.float32)
2scale = [[1.0, 1.0]]
3quant = tp.quantize(input, scale, tp.int4)
4output = tp.dequantize(quant, scale, tp.float32)
>>> output
tensor(
    [[0.0000, 1.0000],
     [2.0000, 3.0000]], 
    dtype=float32, loc=gpu:0, shape=(2, 2))

See also

dequantize()