Int4 Groupwise GEMM#

void trt_edgellm::kernel::gemv_forward_cuda_new( half *in_feats, int8_t *kernel, half *scaling_factors, half *out_feats, int m, int n, int k, int group_size, cudaStream_t stream )#

INT4 group-wise quantized GEMV (matrix-vector multiplication)

Optimized for batch size 1~4 (M=1~4). Performs: out = in @ W_dequantized where W is INT4 quantized with group-wise scaling factors.

Parameters:

in_feats – Input features [M, K] (Primarily optimized for M ~ [1, 4])
kernel – INT4 quantized weight matrix [N/2, K] in int8 (packed int4 format)
scaling_factors – Group-wise scales [K/group_size, N]
out_feats – Output features [M, N]
m – Batch size
n – Output dimension
k – Input dimension
group_size – Quantization group size
stream – CUDA stream

void trt_edgellm::kernel::gemm_forward_cuda_new( half *in_feats, int8_t *kernel, half *scaling_factors, half *out_feats, int m, int n, int k, int group_size, cudaStream_t stream )#

INT4 group-wise quantized GEMM (matrix-matrix multiplication)

Optimized for batch size > 1. Performs: out = in @ W_dequantized where W is INT4 quantized with group-wise scaling factors.

Parameters:

in_feats – Input features [M, K]
kernel – INT4 quantized weight matrix [N/2, K] in int8 (packed int4 format)
scaling_factors – Group-wise scales [K/group_size, N]
out_feats – Output features [M, N]
m – Batch size
n – Output dimension
k – Input dimension
group_size – Quantization group size
stream – CUDA stream