Int4 Groupwise GEMM#

void trt_edgellm::kernel::gemv_forward_cuda_new(
half *in_feats,
int8_t *kernel,
half *scaling_factors,
half *out_feats,
int m,
int n,
int k,
int group_size,
cudaStream_t stream
)#

INT4 group-wise quantized GEMV (matrix-vector multiplication)

Optimized for batch size 1~4 (M=1~4). Performs: out = in @ W_dequantized where W is INT4 quantized with group-wise scaling factors.

Parameters:
  • in_feats – Input features [M, K] (Primarily optimized for M ~ [1, 4])

  • kernel – INT4 quantized weight matrix [N/2, K] in int8 (packed int4 format)

  • scaling_factors – Group-wise scales [K/group_size, N]

  • out_feats – Output features [M, N]

  • m – Batch size

  • n – Output dimension

  • k – Input dimension

  • group_size – Quantization group size

  • stream – CUDA stream

void trt_edgellm::kernel::gemm_forward_cuda_new(
half *in_feats,
int8_t *kernel,
half *scaling_factors,
half *out_feats,
int m,
int n,
int k,
int group_size,
cudaStream_t stream
)#

INT4 group-wise quantized GEMM (matrix-matrix multiplication)

Optimized for batch size > 1. Performs: out = in @ W_dequantized where W is INT4 quantized with group-wise scaling factors.

Parameters:
  • in_feats – Input features [M, K]

  • kernel – INT4 quantized weight matrix [N/2, K] in int8 (packed int4 format)

  • scaling_factors – Group-wise scales [K/group_size, N]

  • out_feats – Output features [M, N]

  • m – Batch size

  • n – Output dimension

  • k – Input dimension

  • group_size – Quantization group size

  • stream – CUDA stream