layer_utils

Utils for model_config export.

Some of the logics in this file are empirical and needs constant update if exceptions occur.

Functions

build_attention_config

Builds the attention config from the module.

build_conv_config

Builds the conv config for this module.

build_decoder_config

Builds the full decoder config from the module.

build_embedding_config

Builds the embedding config from the module.

build_layernorm_config

Builds the layernorm config from the module.

build_linear_config

Builds the linear config for the module.

build_medusa_heads_config

Build a list of MedusaHeadConfig if exists.

build_mlp_config

Builds the MLP config for the module.

build_moe_config

Builds the MOE config for the module.

build_qkv

Converts the qkv modules to the config.

build_recurrent_config

Builds the recurrent config for this module.

build_stacked_experts

Builds the experts_weight_1 and experts_weight_2 configs for the experts.

check_model_compatibility

Returns whether the list of modules is compatible with the export logic.

get_activation_scaling_factor

Returns the activation scaling factor.

get_kv_cache_dtype

Returns the kv_cache dtype.

get_kv_cache_scaling_factor

Returns the kv_cache scaling factor if output quantizer is set.

get_prequant_scaling_factor

Returns the prequant scaling factor.

get_qkv_and_avg_prequant_scale

Get the qkv and average prequant scaling factor for the module.

get_quantization_format

Gets the quantization string.

get_scaling_factor

Returns scaling factor from the quantizer as torch.Tensor.

get_transformer_layers

Returns the root module of the transformer model.

get_weight_block_size

Returns the weight block size.

get_weight_scaling_factor

Returns the weight scaling factor.

get_weight_scaling_factor_2

Returns the secondary weight scaling factor.

is_attention

Returns whether the module is an attention layer.

is_decoder_list

Returns whether the module is a decoder list.

is_embedding

Returns whether the module is an embedding layer.

is_layernorm

Returns whether the module is a layernorm layer.

is_linear

Returns whether the module is a linear layer.

is_mlp

Returns whether the module is an MLP layer.

is_moe

Returns whether the module is an MOE layer.

is_quantlinear

Returns whether the module is a quantized linear layer.

is_recurrent

Returns whether the module is a recurrent layer.

build_attention_config(module, model_metadata_config, dtype, ext_config=None)

Builds the attention config from the module.

Parameters:
Return type:

AttentionConfig

build_conv_config(module, dtype)

Builds the conv config for this module.

Parameters:
  • module (Module) –

  • dtype (dtype) –

Return type:

ConvConfig

build_decoder_config(module, model_metadata_config, decoder_type, dtype)

Builds the full decoder config from the module.

Parameters:
  • module (Module) –

  • decoder_type (str) –

  • dtype (dtype) –

Return type:

DecoderLayerConfig

build_embedding_config(module, dtype, normalization_constant=1)

Builds the embedding config from the module.

Parameters:
  • module (Module) –

  • dtype (dtype) –

  • normalization_constant (float) –

Return type:

EmbeddingConfig

build_layernorm_config(module, dtype)

Builds the layernorm config from the module.

Parameters:
  • module (Module) –

  • dtype (dtype) –

Return type:

LayernormConfig

build_linear_config(module, linear_type, dtype)

Builds the linear config for the module.

Parameters:
  • module (Module) –

  • linear_type (str) –

  • dtype (dtype) –

Return type:

LinearConfig

build_medusa_heads_config(model, dtype)

Build a list of MedusaHeadConfig if exists.

Following TensorRT-LLM’s Medusa implementation, all Medusa heads (num_medusa_heads) should be placed inside a ‘torch.nn.ModuleList’ with attribute name ‘medsua_heads’. A Medusa head composes an additional ‘lm_head’ (vocab_size, hidden_size) and a list (num_medusa_layers) of Medusa layer (LinearActConfig). The only supported hidden_act for the layer is ‘silu’. All Linear layers are column-parallel.

Parameters:
  • model (Module | None) –

  • dtype (dtype) –

Return type:

List[MedusaHeadConfig] | None

build_mlp_config(module, decoder_type, dtype)

Builds the MLP config for the module.

Parameters:
  • module (Module) –

  • dtype (dtype) –

Return type:

MLPConfig

build_moe_config(module, decoder_type, dtype)

Builds the MOE config for the module.

Parameters:
  • module (Module) –

  • dtype (dtype) –

Return type:

MOEConfig

build_qkv(qkv_modules, model_metadata_config, dtype, ext_config=None)

Converts the qkv modules to the config.

Parameters:
Return type:

QKVConfig

build_recurrent_config(module, dtype)

Builds the recurrent config for this module.

Parameters:
  • module (Module) –

  • dtype (dtype) –

build_stacked_experts(experts, dtype, linear_names, num_experts, expert_getter)

Builds the experts_weight_1 and experts_weight_2 configs for the experts.

Parameters:
  • experts (Module) –

  • dtype (dtype) –

  • linear_names (List[str]) –

check_model_compatibility(module_list)

Returns whether the list of modules is compatible with the export logic.

And if positional embedding and embedding layernorm exists.

We assumes the model to be assembled with one or two embedding layers, a ModuleList of transformer decoders, and a final layernorm with optional embedding layernorm. Otherwise it will not be supported.

Parameters:

module_list (List[Module]) –

Return type:

Tuple[bool, bool, bool]

get_activation_scaling_factor(module)

Returns the activation scaling factor.

Parameters:

module (Module) –

Return type:

Tensor

get_kv_cache_dtype(modules)

Returns the kv_cache dtype.

If num_bits of output_quantizer is (4, 3) then returns FP8; if it is 8, returns int8, otherwise returns None.

Parameters:

modules (Union[List[nn.Module], nn.Module]) – The module or list of modules to inspect.

Returns:

The kv_cache dtype.

Return type:

str

get_kv_cache_scaling_factor(qkv_modules)

Returns the kv_cache scaling factor if output quantizer is set. Else returns None by default.

Parameters:

qkv_modules (List[Module]) –

Return type:

Tensor

get_prequant_scaling_factor(module, dtype)

Returns the prequant scaling factor.

Parameters:
  • module (Module) –

  • dtype (dtype) –

Return type:

Tensor

get_qkv_and_avg_prequant_scale(module, dtype)

Get the qkv and average prequant scaling factor for the module.

Parameters:
  • module – The module containing q, k, and v submodules.

  • dtype – The data type for the scaling factors.

Returns:

A tuple containing the average prequant scaling factor and individual

scaling factors for q, k, and v.

Return type:

tuple

get_quantization_format(module)

Gets the quantization string.

Gets the quantization string by iterating through the module and its children. The first non-None quantization string is returned.

Return type:

str | None

get_scaling_factor(quantizer)

Returns scaling factor from the quantizer as torch.Tensor.

Parameters:

quantizer (TensorQuantizer) –

Return type:

Tensor

get_transformer_layers(model)

Returns the root module of the transformer model.

Parameters:

model (Module) –

Return type:

List[Module]

get_weight_block_size(module)

Returns the weight block size.

Parameters:

module (Module) –

Return type:

int

get_weight_scaling_factor(module)

Returns the weight scaling factor.

Parameters:

module (Module) –

Return type:

Tensor

get_weight_scaling_factor_2(module)

Returns the secondary weight scaling factor.

Parameters:

module (Module) –

Return type:

Tensor

is_attention(module)

Returns whether the module is an attention layer.

Parameters:

module (Module) –

Return type:

bool

is_decoder_list(module)

Returns whether the module is a decoder list.

Parameters:

module (Module) –

Return type:

bool

is_embedding(module)

Returns whether the module is an embedding layer.

Parameters:

module (Module) –

Return type:

bool

is_layernorm(module)

Returns whether the module is a layernorm layer.

Parameters:

module (Module) –

Return type:

bool

is_linear(module)

Returns whether the module is a linear layer.

Parameters:

module (Module) –

Return type:

bool

is_mlp(module)

Returns whether the module is an MLP layer.

Parameters:

module (Module) –

Return type:

bool

is_moe(module)

Returns whether the module is an MOE layer.

Parameters:

module (Module) –

Return type:

bool

is_quantlinear(module)

Returns whether the module is a quantized linear layer.

Parameters:

module (Module) –

Return type:

bool

is_recurrent(module)

Returns whether the module is a recurrent layer.

Parameters:

module (Module) –

Return type:

bool