Causal Conv1d#

void mamba_ssm::invokeCausalConv1d( trt_edgellm::rt::Tensor const &x, trt_edgellm::rt::Tensor const &weight, trt_edgellm::rt::OptionalInputTensor bias, trt_edgellm::rt::Tensor &out, int32_t stride, int32_t padding, int32_t dilation, trt_edgellm::rt::OptionalInputTensor contextLengths, cudaStream_t stream, )#

Prefill causal depthwise conv1d.

x: [batch, seq_len, dim] weight: [dim, 1, width] bias: [dim] (optional) contextLengths: [batch] INT32, per-batch actual token count (optional, prefill only) out: [batch, out_seq_len, dim]

void mamba_ssm::invokeCaptureConvState( trt_edgellm::rt::Tensor const &x, trt_edgellm::rt::Tensor &convState, trt_edgellm::rt::OptionalInputTensor contextLengths, cudaStream_t stream, )#

Capture conv state from prefill input.

x: [batch, seqLen, dim] contextLengths: [batch] INT32, per-batch actual token count (optional) convState: [batch, dim, width] (output, zero-initialized before call)

void mamba_ssm::invokeCausalConv1dDecode( trt_edgellm::rt::Tensor &convState, trt_edgellm::rt::Tensor const &newCol, trt_edgellm::rt::Tensor const &weight, trt_edgellm::rt::OptionalInputTensor bias, trt_edgellm::rt::Tensor &out, cudaStream_t stream, )#

Decode-mode conv1d: shift conv_state, insert new column, and compute dot product.

convState: [batch, dim, width] (in-place update) newCol: [batch, 1, dim] (new single-token input) weight: [dim, 1, width] bias: [dim] (optional) out: [batch, 1, dim]

void mamba_ssm::invokeCausalConv1dDecodeMTP( trt_edgellm::rt::Tensor &convState, trt_edgellm::rt::Tensor const &newCols, trt_edgellm::rt::Tensor const &weight, trt_edgellm::rt::OptionalInputTensor bias, trt_edgellm::rt::Tensor &out, trt_edgellm::rt::Tensor &intermediateConvStates, int32_t T, cudaStream_t stream, )#

MTP (multi-token) decode: process T draft tokens with per-step state checkpointing.

For each draft token t in [0, T):

Shift conv_state left by 1, insert newCols[:, t, :]
Compute output = dot(conv_state, weight) + bias
Save intermediate conv_state to intermediateConvStates[:, t, :, :]

convState: [batch, dim, width] FP16 (in-place updated to final state) newCols: [batch, T, dim] FP16 (T draft token inputs) weight: [dim, 1, width] FP16 bias: [dim] FP16 (optional) out: [batch, T, dim] FP16 (T outputs) intermediateConvStates: [batch, T, dim, width] FP16 (per-step state cache for rollback) T: number of draft tokens

void mamba_ssm::invokeCausalConv1dDecodeDDTree( trt_edgellm::rt::Tensor const &convState, trt_edgellm::rt::Tensor const &newCols, trt_edgellm::rt::Tensor const &weight, trt_edgellm::rt::OptionalInputTensor bias, trt_edgellm::rt::Tensor &out, trt_edgellm::rt::Tensor &convStateOut, trt_edgellm::rt::Tensor &intermediateConvStates, trt_edgellm::rt::Tensor const &treeParentIds, trt_edgellm::rt::Tensor const &treeDepths, cudaStream_t stream, )#

DDTree decode: compute each tree node from the persistent conv state plus its root-to-node path.

Node 0 is the root verify token. Its output/state are computed by shifting x[:, 0, :] into the persistent convState. Non-root nodes append the full root-to-node path and write one checkpoint per node.

convState: [batch, dim, width] FP16 (persistent committed state, read-only) newCols: [batch, verifySeq, dim] FP16 (tree-node conv inputs) weight: [dim, 1, width] FP16 bias: [dim] FP16 (optional) out: [batch, verifySeq, dim] FP16 (tree-node conv outputs) convStateOut: [batch, dim, width] FP16 (copy of convState) intermediateConvStates: [batch, verifySeq, dim, width] FP16 (per-node states for accepted-node scatter) treeParentIds: [batch, verifySeq] INT32 (root/padding parent is -1) treeDepths: [batch, verifySeq] INT32 (root/padding depth is 0)