model_config
This module defines the model_config format.
This format can be converted from huggingface, nemo or modelopt-quantized model. And we will build tensorrt_llm engine from the context saved with this format.
Classes
The attention layer config. |
|
The Conv layer config. |
|
The decoder layer config. |
|
The embedding layer config. |
|
The Expert config. |
|
The layernorm layer config. |
|
The linear + activation layer config. |
|
The linear layer config. |
|
The MLP layer config. |
|
The Mixture of Expert layer config. |
|
The decoder layer config. |
|
The full LLM model config that includes the full information needed for tensorrt_llm engine building. |
|
The QKV layer config. |
|
The RecurrentBlock from recurrentgemma. |
|
The RG LRU from recurrentgemma. |
- class AttentionConfig
Bases:
object
The attention layer config.
- __init__(qkv=None, dense=None, kv_cache_scaling_factor=None, kv_cache_dtype=None, rotary_dim=-inf, clip_qkv=None, rel_attn_table=None)
- Parameters:
qkv (QKVConfig | LinearConfig) –
dense (LinearConfig) –
kv_cache_scaling_factor (Tensor) –
kv_cache_dtype (str | None) –
rotary_dim (int) –
clip_qkv (float) –
rel_attn_table (Tensor) –
- Return type:
None
- clip_qkv: float = None
- dense: LinearConfig = None
- kv_cache_dtype: str | None = None
- kv_cache_scaling_factor: Tensor = None
- qkv: QKVConfig | LinearConfig = None
- rel_attn_table: Tensor = None
- rotary_dim: int = -inf
- class ConvConfig
Bases:
object
The Conv layer config.
- __init__(quantization=None, weight=None, bias=None)
- Parameters:
quantization (str | None) –
weight (Tensor) –
bias (Tensor) –
- Return type:
None
- bias: Tensor = None
- quantization: str | None = None
- weight: Tensor = None
- class DecoderLayerConfig
Bases:
object
The decoder layer config.
- __init__(quantization=None, decoder_type='', input_layernorm=None, mlp_layernorm=None, attention=None, recurrent=None, post_layernorm=None, pre_feedforward_layernorm=None, post_feedforward_layernorm=None, mlp=None, num_attention_heads=0, attention_head_size=None, num_kv_heads=0, max_position_embeddings=0, rotary_pct=1.0, use_alibi=False, new_decoder_architecture=False, parallel_attention=False, apply_residual_connection_post_layernorm=False, use_cache=True, chatglm_version='', rope_ratio=1.0, seq_length=0, qwen_type='', rotary_base=0, partial_rotary_factor=0, original_max_position_embeddings=0, longrope_scaling_short_factors=None, longrope_scaling_long_factors=None, mup_attn_multiplier=0, mup_embedding_multiplier=0, mup_use_scaling=0, mup_width_multiplier=0, blocksparse_block_size=0, blocksparse_homo_head_pattern=False, blocksparse_num_local_blocks=0, blocksparse_vertical_stride=0, dense_attention_every_n_layers=0, gegelu_limit=0, longrope_short_mscale=0, longrope_long_mscale=0, moe_num_experts=0, moe_top_k=0, moe_tp_mode=0, moe_renorm_mode=0, alibi_bias_max=0, residual_layernorm=None, residual_mlp=None, rnn_hidden_size=0, logits_soft_cap=0, emb_scale_by_sqrt_dim=False, layer_types=<factory>, attn_replacing_linear=None, mlp_replacing_linear=None, block_config=None, final_logit_softcapping=0, attn_logit_softcapping=0, query_pre_attn_scalar=0, clip_qkv=0, cross_attention=None, cross_attention_layernorm=None, self_attention=None, self_attention_layernorm=None, attention_layernorm=None, rel_attn_max_distance=0, rel_attn_num_buckets=0, rope_scaling=None)
- Parameters:
quantization (str | None) –
decoder_type (str) –
input_layernorm (LayernormConfig) –
mlp_layernorm (LayernormConfig) –
attention (AttentionConfig) –
recurrent (RecurrentConfig) –
post_layernorm (LayernormConfig) –
pre_feedforward_layernorm (LayernormConfig) –
post_feedforward_layernorm (LayernormConfig) –
num_attention_heads (int) –
attention_head_size (int) –
num_kv_heads (int) –
max_position_embeddings (int) –
rotary_pct (float) –
use_alibi (bool) –
new_decoder_architecture (bool) –
parallel_attention (bool) –
apply_residual_connection_post_layernorm (bool) –
use_cache (bool) –
chatglm_version (str) –
rope_ratio (float) –
seq_length (int) –
qwen_type (str) –
rotary_base (int) –
partial_rotary_factor (float) –
original_max_position_embeddings (int) –
longrope_scaling_short_factors (List[float]) –
longrope_scaling_long_factors (List[float]) –
mup_attn_multiplier (float) –
mup_embedding_multiplier (float) –
mup_use_scaling (float) –
mup_width_multiplier (float) –
blocksparse_block_size (int) –
blocksparse_homo_head_pattern (bool) –
blocksparse_num_local_blocks (int) –
blocksparse_vertical_stride (int) –
dense_attention_every_n_layers (int) –
gegelu_limit (float) –
longrope_short_mscale (float) –
longrope_long_mscale (float) –
moe_num_experts (int) –
moe_top_k (int) –
moe_tp_mode (int) –
moe_renorm_mode (int) –
alibi_bias_max (int) –
residual_layernorm (LayernormConfig) –
residual_mlp (MLPConfig) –
rnn_hidden_size (int) –
logits_soft_cap (float) –
emb_scale_by_sqrt_dim (bool) –
layer_types (List[str]) –
attn_replacing_linear (LinearConfig) –
mlp_replacing_linear (LinearConfig) –
block_config (dict) –
final_logit_softcapping (float) –
attn_logit_softcapping (float) –
query_pre_attn_scalar (float) –
clip_qkv (int) –
cross_attention (AttentionConfig) –
cross_attention_layernorm (LayernormConfig) –
self_attention (AttentionConfig) –
self_attention_layernorm (LayernormConfig) –
attention_layernorm (LayernormConfig) –
rel_attn_max_distance (int) –
rel_attn_num_buckets (int) –
rope_scaling (dict) –
- Return type:
None
- alibi_bias_max: int = 0
- apply_residual_connection_post_layernorm: bool = False
- attention: AttentionConfig = None
- attention_head_size: int = None
- attention_layernorm: LayernormConfig = None
- attn_logit_softcapping: float = 0
- attn_replacing_linear: LinearConfig = None
- block_config: dict = None
- blocksparse_block_size: int = 0
- blocksparse_homo_head_pattern: bool = False
- blocksparse_num_local_blocks: int = 0
- blocksparse_vertical_stride: int = 0
- chatglm_version: str = ''
- clip_qkv: int = 0
- cross_attention: AttentionConfig = None
- cross_attention_layernorm: LayernormConfig = None
- decoder_type: str = ''
- dense_attention_every_n_layers: int = 0
- emb_scale_by_sqrt_dim: bool = False
Returns the ffn hidden size of the transformer model.
- final_logit_softcapping: float = 0
- gegelu_limit: float = 0
Returns the hidden size of the transformer model.
- input_layernorm: LayernormConfig = None
- layer_types: List[str]
- logits_soft_cap: float = 0
- longrope_long_mscale: float = 0
- longrope_scaling_long_factors: List[float] = None
- longrope_scaling_short_factors: List[float] = None
- longrope_short_mscale: float = 0
- max_position_embeddings: int = 0
- mlp_layernorm: LayernormConfig = None
- mlp_replacing_linear: LinearConfig = None
- moe_num_experts: int = 0
- moe_renorm_mode: int = 0
- moe_top_k: int = 0
- moe_tp_mode: int = 0
- mup_attn_multiplier: float = 0
- mup_embedding_multiplier: float = 0
- mup_use_scaling: float = 0
- mup_width_multiplier: float = 0
- new_decoder_architecture: bool = False
- num_attention_heads: int = 0
- num_kv_heads: int = 0
- original_max_position_embeddings: int = 0
- parallel_attention: bool = False
- partial_rotary_factor: float = 0
- post_feedforward_layernorm: LayernormConfig = None
- post_layernorm: LayernormConfig = None
- pre_feedforward_layernorm: LayernormConfig = None
- quantization: str | None = None
- query_pre_attn_scalar: float = 0
- qwen_type: str = ''
- recurrent: RecurrentConfig = None
- rel_attn_max_distance: int = 0
- rel_attn_num_buckets: int = 0
- residual_layernorm: LayernormConfig = None
- rope_ratio: float = 1.0
- rope_scaling: dict = None
- rotary_base: int = 0
- rotary_pct: float = 1.0
- self_attention: AttentionConfig = None
- self_attention_layernorm: LayernormConfig = None
- seq_length: int = 0
- use_alibi: bool = False
- use_cache: bool = True
- class EmbeddingConfig
Bases:
object
The embedding layer config.
- __init__(weight=None, quantization=None)
- Parameters:
weight (Tensor) –
quantization (str | None) –
- Return type:
None
Infers the hidden_size from the embedding layer weights shape.
- property local_vocab_size
Infers the vocab_size from the embedding layer weights shape.
- quantization: str | None = None
- weight: Tensor = None
- class ExpertConfig
Bases:
object
The Expert config.
- __init__(fc=None, proj=None)
- Parameters:
fc (LinearConfig) –
proj (LinearConfig) –
- Return type:
None
- fc: LinearConfig = None
- proj: LinearConfig = None
- class LayernormConfig
Bases:
object
The layernorm layer config.
- __init__(quantization=None, weight=None, bias=None, layernorm_type='', eps=1e-05)
- Parameters:
quantization (str | None) –
weight (Tensor) –
bias (Tensor) –
layernorm_type (str) –
eps (float) –
- Return type:
None
- bias: Tensor = None
- eps: float = 1e-05
- layernorm_type: str = ''
- quantization: str | None = None
- weight: Tensor = None
- class LinearActConfig
Bases:
object
The linear + activation layer config.
- __init__(linear=None, hidden_act='')
- Parameters:
linear (LinearConfig) –
hidden_act (str) –
- Return type:
None
- linear: LinearConfig = None
- class LinearConfig
Bases:
object
The linear layer config.
- __init__(quantization=None, linear_type='column', weight=None, bias=None, activation_scaling_factor=None, weights_scaling_factor=None, weights_scaling_factor_2=None, prequant_scaling_factor=None, awq_block_size=0)
- Parameters:
quantization (str | None) –
linear_type (str) –
weight (Tensor) –
bias (Tensor) –
activation_scaling_factor (Tensor) –
weights_scaling_factor (Tensor) –
weights_scaling_factor_2 (Tensor) –
prequant_scaling_factor (Tensor) –
awq_block_size (int) –
- Return type:
None
- activation_scaling_factor: Tensor = None
- awq_block_size: int = 0
- bias: Tensor = None
- linear_type: str = 'column'
- prequant_scaling_factor: Tensor = None
- quantization: str | None = None
- weight: Tensor = None
- weights_scaling_factor: Tensor = None
- weights_scaling_factor_2: Tensor = None
- class MLPConfig
Bases:
object
The MLP layer config.
- __init__(fc=None, gate=None, proj=None, hidden_act='', merged_fc1_gate=False)
- Parameters:
fc (LinearConfig) –
gate (LinearConfig) –
proj (LinearConfig) –
hidden_act (str) –
merged_fc1_gate (bool) –
- Return type:
None
- fc: LinearConfig = None
- gate: LinearConfig = None
- merged_fc1_gate: bool = False
- proj: LinearConfig = None
- property quantization
Returns merged gat and fc1 quantization.
- class MOEConfig
Bases:
object
The Mixture of Expert layer config.
- __init__(router=None, experts=None, hidden_act='')
- Parameters:
router (LinearConfig) –
experts (ExpertConfig) –
hidden_act (str) –
- Return type:
None
- experts: ExpertConfig = None
- property fc
Return the fc module from experts.
- router: LinearConfig = None
- class MedusaHeadConfig
Bases:
object
The decoder layer config.
- __init__(medusa_layers=None, lm_head=None)
- Parameters:
medusa_layers (List[LinearActConfig]) –
lm_head (LinearConfig) –
- Return type:
None
- lm_head: LinearConfig = None
- medusa_layers: List[LinearActConfig] = None
- class ModelConfig
Bases:
object
The full LLM model config that includes the full information needed for tensorrt_llm engine building.
This class includes all the fields that tensorrt_llm supports, but not all of the fields are required. pipeline_parallel > 1 is only supported for TensorRT-LLM checkpoint.
- __init__(version=0.0, quantization=None, dtype='float16', vocab_size=0, rank=0, tensor_parallel=1, pipeline_parallel=1, vocab_embedding=None, position_embedding=None, block_embedding=None, ln_embed=None, layers=<factory>, ln_f=None, lm_head=None, share_embedding_table=False, medusa_heads=None, num_medusa_heads=0, num_medusa_layers=0, enc_dec='', encoder_hidden_size=0, encoder_num_heads=0, encoder_head_size=0)
- Parameters:
version (float) –
quantization (str) –
dtype (str) –
vocab_size (int) –
rank (int) –
tensor_parallel (int) –
pipeline_parallel (int) –
vocab_embedding (EmbeddingConfig) –
position_embedding (EmbeddingConfig) –
block_embedding (EmbeddingConfig) –
ln_embed (LayernormConfig) –
layers (List[DecoderLayerConfig]) –
ln_f (LayernormConfig) –
lm_head (LinearConfig) –
share_embedding_table (bool) –
medusa_heads (List[MedusaHeadConfig]) –
num_medusa_heads (int) –
num_medusa_layers (int) –
enc_dec (str) –
encoder_hidden_size (int) –
encoder_num_heads (int) –
encoder_head_size (int) –
- Return type:
None
- block_embedding: EmbeddingConfig = None
- dtype: str = 'float16'
- enc_dec: str = ''
- encoder_head_size: int = 0
- encoder_num_heads: int = 0
Returns the hidden_act of the model.
Returns the hidden_size of the model.
- layers: List[DecoderLayerConfig]
- lm_head: LinearConfig = None
- ln_embed: LayernormConfig = None
- ln_f: LayernormConfig = None
- property max_position_embeddings
Returns the max_position_embedding of the model.
- medusa_heads: List[MedusaHeadConfig] = None
- property num_attention_heads
Returns the num_attention_heads of the model.
- property num_kv_heads
Returns the num_key_value_heads of the model.
- num_medusa_heads: int = 0
- num_medusa_layers: int = 0
- pipeline_parallel: int = 1
- position_embedding: EmbeddingConfig = None
- quantization: str = None
- rank: int = 0
- tensor_parallel: int = 1
- version: float = 0.0
- vocab_embedding: EmbeddingConfig = None
- vocab_size: int = 0
- property vocab_size_padded
Returns the padded vocab_size of the model rounds to the tensor_parallel.
- class QKVConfig
Bases:
object
The QKV layer config.
- __init__(q=None, k=None, v=None)
- Parameters:
q (LinearConfig) –
k (LinearConfig) –
v (LinearConfig) –
- Return type:
None
- property activation_scaling_factor
Returns the merged activation_scaling_factor across Q, K and V.
The max of the Q, K, V activation scaling factors is returned.
- property awq_block_size
Returns the awq_block_size of this QKV layer.
- property bias
The generated linear layer bias.
The Q, K, V bias are concat together to fit the TensorRT-LLM QKV linear layer.
- k: LinearConfig = None
- property prequant_scaling_factor
Returns the merged prequant_scaling_factor across Q, K and V.
Prequant scaling factors for Q, K, V should be the same. So just return one of them.
- q: LinearConfig = None
- property quantization
Returns the quantization format of this QKV layer.
- v: LinearConfig = None
- property weight
The generated linear layer weight.
The Q, K, V weights are concat together to fit the TensorRT-LLM QKV linear layer.
- property weights_scaling_factor
Returns the merged weights_scaling_factor across Q, K and V.
If the quantization is FP8, the max of the Q, K, V weight scaling factors is returned. If the quanitzation is INT8_SQ, the concat value is returned.
- property weights_scaling_factor_2
Returns the merged weights_scaling_factor_2 across Q, K and V.
weight_scaling_factor_2 is needed for W4A8 AWQ.
- class RecurrentConfig
Bases:
object
The RecurrentBlock from recurrentgemma.
- __init__(linear_y=None, y_bias=None, linear_x=None, linear_out=None, conv1d=None, rg_lru=None)
- Parameters:
linear_y (LinearConfig) –
y_bias (Tensor) –
linear_x (LinearConfig) –
linear_out (LinearConfig) –
conv1d (ConvConfig) –
rg_lru (RgLruConfig) –
- Return type:
None
- conv1d: ConvConfig = None
- linear_out: LinearConfig = None
- linear_x: LinearConfig = None
- linear_y: LinearConfig = None
- rg_lru: RgLruConfig = None
- y_bias: Tensor = None
- class RgLruConfig
Bases:
object
The RG LRU from recurrentgemma.
- __init__(recurrent_param=None, input_gate=None, recurrent_gate=None)
- Parameters:
recurrent_param (Tensor) –
input_gate (LinearConfig) –
recurrent_gate (LinearConfig) –
- Return type:
None
- input_gate: LinearConfig = None
- recurrent_gate: LinearConfig = None
- recurrent_param: Tensor = None