model_config

This module defines the model_config format.

This format can be converted from huggingface, nemo or modelopt-quantized model. And we will build tensorrt_llm engine from the context saved with this format.

Classes

`AttentionConfig`	The attention layer config.
`ConvConfig`	The Conv layer config.
`DecoderLayerConfig`	The decoder layer config.
`EmbeddingConfig`	The embedding layer config.
`ExpertConfig`	The Expert config.
`LayernormConfig`	The layernorm layer config.
`LinearActConfig`	The linear + activation layer config.
`LinearConfig`	The linear layer config.
`MLPConfig`	The MLP layer config.
`MOEConfig`	The Mixture of Expert layer config.
`MedusaHeadConfig`	The decoder layer config.
`ModelConfig`	The full LLM model config that includes the full information needed for tensorrt_llm engine building.
`QKVConfig`	The QKV layer config.
`RecurrentConfig`	The RecurrentBlock from recurrentgemma.
`RelativeAttentionTableConfig`	Relative attention table config.
`RgLruConfig`	The RG LRU from recurrentgemma.

class AttentionConfig

Bases: object

The attention layer config.

__init__(qkv=None, dense=None, k_cache_scaling_factor=None, v_cache_scaling_factor=None, k_cache_bias=None, v_cache_bias=None, kv_cache_dtype=None, rotary_dim=-inf, clip_qkv=None, rel_attn_table=None, q_layernorm=None, k_layernorm=None)

Parameters:

qkv (QKVConfig | LinearConfig)
dense (LinearConfig)
k_cache_scaling_factor (Tensor)
v_cache_scaling_factor (Tensor)
k_cache_bias (Tensor)
v_cache_bias (Tensor)
kv_cache_dtype (str | None)
rotary_dim (int)
clip_qkv (float)
rel_attn_table (RelativeAttentionTableConfig)
q_layernorm (LayernormConfig)
k_layernorm (LayernormConfig)

Return type:

None

clip_qkv: float = None

dense: LinearConfig = None

k_cache_bias: Tensor = None

k_cache_scaling_factor: Tensor = None

k_layernorm: LayernormConfig = None

kv_cache_dtype: str | None = None

q_layernorm: LayernormConfig = None

qkv: QKVConfig | LinearConfig = None

rel_attn_table: RelativeAttentionTableConfig = None

rotary_dim: int = -inf

v_cache_bias: Tensor = None

v_cache_scaling_factor: Tensor = None

class ConvConfig

Bases: object

The Conv layer config.

__init__(quantization=None, weight=None, bias=None)

Parameters:

quantization (str | None)
weight (Tensor)
bias (Tensor)

Return type:

None

bias: Tensor = None

quantization: str | None = None

weight: Tensor = None

class DecoderLayerConfig

Bases: object

The decoder layer config.

__init__(decoder_type='', input_layernorm=None, mlp_layernorm=None, attention=None, recurrent=None, post_layernorm=None, pre_feedforward_layernorm=None, post_feedforward_layernorm=None, mlp=None, num_attention_heads=0, attention_head_size=None, num_kv_heads=0, max_position_embeddings=0, rotary_pct=1.0, use_alibi=False, new_decoder_architecture=False, parallel_attention=False, apply_residual_connection_post_layernorm=False, use_cache=True, chatglm_version='', rope_ratio=1.0, seq_length=0, qwen_type='', rotary_base=0, partial_rotary_factor=0, original_max_position_embeddings=0, longrope_scaling_short_factors=None, longrope_scaling_long_factors=None, mup_attn_multiplier=0, mup_embedding_multiplier=0, mup_use_scaling=0, mup_width_multiplier=0, blocksparse_block_size=0, blocksparse_homo_head_pattern=False, blocksparse_num_local_blocks=0, blocksparse_vertical_stride=0, dense_attention_every_n_layers=0, gegelu_limit=0, longrope_short_mscale=0, longrope_long_mscale=0, moe_num_experts=0, moe_top_k=0, moe_tp_mode=0, moe_renorm_mode=0, alibi_bias_max=0, residual_layernorm=None, residual_mlp=None, rnn_hidden_size=0, logits_soft_cap=0, emb_scale_by_sqrt_dim=False, layer_types=<factory>, attn_replacing_linear=None, mlp_replacing_linear=None, block_config=None, final_logit_softcapping=0, attn_logit_softcapping=0, query_pre_attn_scalar=0, clip_qkv=0, cross_attention=None, cross_attention_layernorm=None, self_attention=None, self_attention_layernorm=None, attention_layernorm=None, rel_attn_max_distance=0, rel_attn_num_buckets=0, rope_scaling=None, cross_attention_layers=None, vision_output_dim=0, gate_ffwd=None, gate_attn=None, sparse_mixer_epsilon=0, position_embedding_type=None)

Parameters:

decoder_type (str)
input_layernorm (LayernormConfig)
mlp_layernorm (LayernormConfig)
attention (AttentionConfig)
recurrent (RecurrentConfig)
post_layernorm (LayernormConfig)
pre_feedforward_layernorm (LayernormConfig)
post_feedforward_layernorm (LayernormConfig)
mlp (MLPConfig | MOEConfig)
num_attention_heads (int)
attention_head_size (int)
num_kv_heads (int)
max_position_embeddings (int)
rotary_pct (float)
use_alibi (bool)
new_decoder_architecture (bool)
parallel_attention (bool)
apply_residual_connection_post_layernorm (bool)
use_cache (bool)
chatglm_version (str)
rope_ratio (float)
seq_length (int)
qwen_type (str)
rotary_base (int)
partial_rotary_factor (float)
original_max_position_embeddings (int)
longrope_scaling_short_factors (list[float])
longrope_scaling_long_factors (list[float])
mup_attn_multiplier (float)
mup_embedding_multiplier (float)
mup_use_scaling (float)
mup_width_multiplier (float)
blocksparse_block_size (int)
blocksparse_homo_head_pattern (bool)
blocksparse_num_local_blocks (int)
blocksparse_vertical_stride (int)
dense_attention_every_n_layers (int)
gegelu_limit (float)
longrope_short_mscale (float)
longrope_long_mscale (float)
moe_num_experts (int)
moe_top_k (int)
moe_tp_mode (int)
moe_renorm_mode (int)
alibi_bias_max (int)
residual_layernorm (LayernormConfig)
residual_mlp (MLPConfig)
rnn_hidden_size (int)
logits_soft_cap (float)
emb_scale_by_sqrt_dim (bool)
layer_types (list[str])
attn_replacing_linear (LinearConfig)
mlp_replacing_linear (LinearConfig)
block_config (dict)
final_logit_softcapping (float)
attn_logit_softcapping (float)
query_pre_attn_scalar (float)
clip_qkv (int)
cross_attention (AttentionConfig)
cross_attention_layernorm (LayernormConfig)
self_attention (AttentionConfig)
self_attention_layernorm (LayernormConfig)
attention_layernorm (LayernormConfig)
rel_attn_max_distance (int)
rel_attn_num_buckets (int)
rope_scaling (dict)
cross_attention_layers (dict)
vision_output_dim (int)
gate_ffwd (Tensor)
gate_attn (Tensor)
sparse_mixer_epsilon (float)
position_embedding_type (str)

Return type:

None

alibi_bias_max: int = 0

apply_residual_connection_post_layernorm: bool = False

attention: AttentionConfig = None

attention_head_size: int = None

attention_layernorm: LayernormConfig = None

attn_logit_softcapping: float = 0

attn_replacing_linear: LinearConfig = None

block_config: dict = None

blocksparse_block_size: int = 0

blocksparse_homo_head_pattern: bool = False

blocksparse_num_local_blocks: int = 0

blocksparse_vertical_stride: int = 0

chatglm_version: str = ''

clip_qkv: int = 0

cross_attention: AttentionConfig = None

cross_attention_layernorm: LayernormConfig = None

cross_attention_layers: dict = None

decoder_type: str = ''

dense_attention_every_n_layers: int = 0

emb_scale_by_sqrt_dim: bool = False

property ffn_hidden_size_local: Returns the ffn hidden size of the transformer model.

final_logit_softcapping: float = 0

gate_attn: Tensor = None

gate_ffwd: Tensor = None

gegelu_limit: float = 0

property hidden_size: Returns the hidden size of the transformer model.

input_layernorm: LayernormConfig = None

layer_types: list[str]

logits_soft_cap: float = 0

longrope_long_mscale: float = 0

longrope_scaling_long_factors: list[float] = None

longrope_scaling_short_factors: list[float] = None

longrope_short_mscale: float = 0

max_position_embeddings: int = 0

mlp: MLPConfig | MOEConfig = None

mlp_layernorm: LayernormConfig = None

mlp_replacing_linear: LinearConfig = None

moe_num_experts: int = 0

moe_renorm_mode: int = 0

moe_top_k: int = 0

moe_tp_mode: int = 0

mup_attn_multiplier: float = 0

mup_embedding_multiplier: float = 0

mup_use_scaling: float = 0

mup_width_multiplier: float = 0

new_decoder_architecture: bool = False

num_attention_heads: int = 0

num_kv_heads: int = 0

original_max_position_embeddings: int = 0

parallel_attention: bool = False

partial_rotary_factor: float = 0

position_embedding_type: str = None

post_feedforward_layernorm: LayernormConfig = None

post_layernorm: LayernormConfig = None

pre_feedforward_layernorm: LayernormConfig = None

query_pre_attn_scalar: float = 0

qwen_type: str = ''

recurrent: RecurrentConfig = None

rel_attn_max_distance: int = 0

rel_attn_num_buckets: int = 0

residual_layernorm: LayernormConfig = None

residual_mlp: MLPConfig = None

rnn_hidden_size: int = 0

rope_ratio: float = 1.0

rope_scaling: dict = None

rotary_base: int = 0

rotary_pct: float = 1.0

self_attention: AttentionConfig = None

self_attention_layernorm: LayernormConfig = None

seq_length: int = 0

sparse_mixer_epsilon: float = 0

use_alibi: bool = False

use_cache: bool = True

vision_output_dim: int = 0

class EmbeddingConfig

Bases: object

The embedding layer config.

__init__(weight=None)

Parameters:: weight (Tensor)
Return type:: None

property hidden_size: Infers the hidden_size from the embedding layer weights shape.

property local_vocab_size: Infers the vocab_size from the embedding layer weights shape.

weight: Tensor = None

class ExpertConfig

Bases: object

The Expert config.

__init__(fc=None, proj=None)

Parameters:

fc (LinearConfig)
proj (LinearConfig)

Return type:

None

fc: LinearConfig = None

proj: LinearConfig = None

class LayernormConfig

Bases: object

The layernorm layer config.

__init__(weight=None, bias=None, layernorm_type='LayerNorm', eps=1e-05)

Parameters:

weight (Tensor)
bias (Tensor)
layernorm_type (str)
eps (float)

Return type:

None

bias: Tensor = None

eps: float = 1e-05

layernorm_type: str = 'LayerNorm'

weight: Tensor = None

class LinearActConfig

Bases: object

The linear + activation layer config.

__init__(linear=None, hidden_act='')

Parameters:

linear (LinearConfig)
hidden_act (str)

Return type:

None

hidden_act: str = ''

linear: LinearConfig = None

class LinearConfig

Bases: object

The linear layer config.

__init__(quantization=None, linear_type='column', weight=None, bias=None, activation_scaling_factor=None, weights_scaling_factor=None, weights_scaling_factor_2=None, prequant_scaling_factor=None, per_channel_scale=None, awq_block_size=0, tp=True)

Parameters:

quantization (str | None)
linear_type (str)
weight (Tensor)
bias (Tensor)
activation_scaling_factor (Tensor)
weights_scaling_factor (Tensor)
weights_scaling_factor_2 (Tensor)
prequant_scaling_factor (Tensor)
per_channel_scale (Tensor)
awq_block_size (int)
tp (bool)

Return type:

None

activation_scaling_factor: Tensor = None

awq_block_size: int = 0

bias: Tensor = None

linear_type: str = 'column'

per_channel_scale: Tensor = None

prequant_scaling_factor: Tensor = None

quantization: str | None = None

tp: bool = True

weight: Tensor = None

weights_scaling_factor: Tensor = None

weights_scaling_factor_2: Tensor = None

class MLPConfig

Bases: object

The MLP layer config.

__init__(fc=None, gate=None, proj=None, hidden_act='', merge_gate_fc=False)

Parameters:

fc (LinearConfig)
gate (LinearConfig)
proj (LinearConfig)
hidden_act (str)
merge_gate_fc (bool)

Return type:

None

fc: LinearConfig = None

gate: LinearConfig = None

hidden_act: str = ''

merge_gate_fc: bool = False

proj: LinearConfig = None

class MOEConfig

Bases: object

The Mixture of Expert layer config.

__init__(router=None, experts=None, shared_expert=None, shared_expert_gate=None, hidden_act='')

Parameters:

router (LinearConfig)
experts (ExpertConfig)
shared_expert (MLPConfig)
shared_expert_gate (LinearConfig)
hidden_act (str)

Return type:

None

experts: ExpertConfig = None

property fc: Return the fc module from experts.

hidden_act: str = ''

router: LinearConfig = None

shared_expert: MLPConfig = None

shared_expert_gate: LinearConfig = None

class MedusaHeadConfig

Bases: object

The decoder layer config.

__init__(medusa_layers=None, lm_head=None)

Parameters:

medusa_layers (list[LinearActConfig])
lm_head (LinearConfig)

Return type:

None

lm_head: LinearConfig = None

medusa_layers: list[LinearActConfig] = None

class ModelConfig

Bases: object

The full LLM model config that includes the full information needed for tensorrt_llm engine building.

This class includes all the fields that tensorrt_llm supports, but not all of the fields are required. pipeline_parallel > 1 is only supported for TensorRT-LLM checkpoint.

__init__(architecture='', quantization=None, dtype='float16', vocab_size=0, rank=0, tensor_parallel=1, pipeline_parallel=1, vocab_embedding=None, position_embedding=None, block_embedding=None, ln_embed=None, layers=<factory>, ln_f=None, lm_head=None, share_embedding_table=False, medusa_heads=None, num_medusa_heads=0, num_medusa_layers=0, enc_dec='', encoder_hidden_size=0, encoder_num_heads=0, encoder_head_size=0, decoder_start_token_id=None, eos_token_id=None, bos_token_id=None, pad_token_id=None, conv1=None, conv2=None)

Parameters:

architecture (str)
quantization (str)
dtype (str)
vocab_size (int)
rank (int)
tensor_parallel (int)
pipeline_parallel (int)
vocab_embedding (EmbeddingConfig)
position_embedding (EmbeddingConfig)
block_embedding (EmbeddingConfig)
ln_embed (LayernormConfig)
layers (list[DecoderLayerConfig])
ln_f (LayernormConfig)
lm_head (LinearConfig)
share_embedding_table (bool)
medusa_heads (list[MedusaHeadConfig])
num_medusa_heads (int)
num_medusa_layers (int)
enc_dec (str)
encoder_hidden_size (int)
encoder_num_heads (int)
encoder_head_size (int)
decoder_start_token_id (int)
eos_token_id (int)
bos_token_id (int)
pad_token_id (int)
conv1 (ConvConfig)
conv2 (ConvConfig)

Return type:

None

architecture: str = ''

block_embedding: EmbeddingConfig = None

bos_token_id: int = None

conv1: ConvConfig = None

conv2: ConvConfig = None

decoder_start_token_id: int = None

dtype: str = 'float16'

enc_dec: str = ''

encoder_head_size: int = 0

encoder_hidden_size: int = 0

encoder_num_heads: int = 0

eos_token_id: int = None

property hidden_act: Returns the hidden_act of the model.

property hidden_size: Returns the hidden_size of the model.

layers: list[DecoderLayerConfig]

lm_head: LinearConfig = None

ln_embed: LayernormConfig = None

ln_f: LayernormConfig = None

property max_position_embeddings: Returns the max_position_embedding of the model.

medusa_heads: list[MedusaHeadConfig] = None

property num_attention_heads: Returns the num_attention_heads of the model.

property num_kv_heads: Returns the num_key_value_heads of the model.

num_medusa_heads: int = 0

num_medusa_layers: int = 0

pad_token_id: int = None

pipeline_parallel: int = 1

position_embedding: EmbeddingConfig = None

quantization: str = None

rank: int = 0

share_embedding_table: bool = False

tensor_parallel: int = 1

vocab_embedding: EmbeddingConfig = None

vocab_size: int = 0

property vocab_size_padded: Returns the padded vocab_size of the model rounds to the tensor_parallel.

class QKVConfig

Bases: object

The QKV layer config.

__init__(q=None, k=None, v=None)

Parameters:

q (LinearConfig)
k (LinearConfig)
v (LinearConfig)

Return type:

None

property activation_scaling_factor

Returns the merged activation_scaling_factor across Q, K and V.

The max of the Q, K, V activation scaling factors is returned.

property awq_block_size: Returns the awq_block_size of this QKV layer.

property bias

The generated linear layer bias.

The Q, K, V bias are concat together to fit the TensorRT-LLM QKV linear layer.

k: LinearConfig = None

property prequant_scaling_factor

Returns the merged prequant_scaling_factor across Q, K and V.

Prequant scaling factors for Q, K, V should be the same. So just return one of them.

q: LinearConfig = None

v: LinearConfig = None

property weight

The generated linear layer weight.

The Q, K, V weights are concat together to fit the TensorRT-LLM QKV linear layer.

property weights_scaling_factor

Returns the merged weights_scaling_factor across Q, K and V.

If the quantization is FP8, the max of the Q, K, V weight scaling factors is returned. If the quanitzation is INT8_SQ, the concat value is returned.

property weights_scaling_factor_2

Returns the merged weights_scaling_factor_2 across Q, K and V.

weight_scaling_factor_2 is needed for W4A8 AWQ.

class RecurrentConfig

Bases: object

The RecurrentBlock from recurrentgemma.

__init__(linear_y=None, y_bias=None, linear_x=None, linear_out=None, conv1d=None, rg_lru=None)

Parameters:

linear_y (LinearConfig)
y_bias (Tensor)
linear_x (LinearConfig)
linear_out (LinearConfig)
conv1d (ConvConfig)
rg_lru (RgLruConfig)

Return type:

None

conv1d: ConvConfig = None

linear_out: LinearConfig = None

linear_x: LinearConfig = None

linear_y: LinearConfig = None

rg_lru: RgLruConfig = None

y_bias: Tensor = None

class RelativeAttentionTableConfig

Bases: object

Relative attention table config. For splitting purpose.

__init__(weight=None)

Parameters:: weight (Tensor)
Return type:: None

weight: Tensor = None

class RgLruConfig

Bases: object

The RG LRU from recurrentgemma.

__init__(recurrent_param=None, input_gate=None, recurrent_gate=None)

Parameters:

recurrent_param (Tensor)
input_gate (LinearConfig)
recurrent_gate (LinearConfig)

Return type:

None

input_gate: LinearConfig = None

recurrent_gate: LinearConfig = None

recurrent_param: Tensor = None