Int4 Groupwise GEMM#
- void trt_edgellm::kernel::gemv_forward_cuda_new(
- half *in_feats,
- int8_t *kernel,
- half *scaling_factors,
- half *out_feats,
- int m,
- int n,
- int k,
- int group_size,
- cudaStream_t stream
INT4 group-wise quantized GEMV (matrix-vector multiplication)
Optimized for batch size 1~4 (M=1~4). Performs: out = in @ W_dequantized where W is INT4 quantized with group-wise scaling factors.
- Parameters:
in_feats – Input features [M, K] (Primarily optimized for M ~ [1, 4])
kernel – INT4 quantized weight matrix [N/2, K] in int8 (packed int4 format)
scaling_factors – Group-wise scales [K/group_size, N]
out_feats – Output features [M, N]
m – Batch size
n – Output dimension
k – Input dimension
group_size – Quantization group size
stream – CUDA stream
- void trt_edgellm::kernel::gemm_forward_cuda_new(
- half *in_feats,
- int8_t *kernel,
- half *scaling_factors,
- half *out_feats,
- int m,
- int n,
- int k,
- int group_size,
- cudaStream_t stream
INT4 group-wise quantized GEMM (matrix-matrix multiplication)
Optimized for batch size > 1. Performs: out = in @ W_dequantized where W is INT4 quantized with group-wise scaling factors.
- Parameters:
in_feats – Input features [M, K]
kernel – INT4 quantized weight matrix [N/2, K] in int8 (packed int4 format)
scaling_factors – Group-wise scales [K/group_size, N]
out_feats – Output features [M, N]
m – Batch size
n – Output dimension
k – Input dimension
group_size – Quantization group size
stream – CUDA stream