anymodel

AnyModel: Architecture-agnostic model compression for HuggingFace models.

This module provides a declarative approach to model compression that works with any HuggingFace model without requiring custom modeling code. Instead of duplicating HuggingFace modeling classes, AnyModel uses ModelDescriptors that define:

  1. Which decoder layer class(es) to patch for heterogeneous configs

  2. How to map BlockConfig to layer-specific overrides

  3. Weight name patterns for subblock checkpointing

Example usage:
>>> from modelopt.torch.puzzletron.anymodel import convert_model
>>> convert_model(
...     input_dir="path/to/hf_checkpoint",
...     output_dir="path/to/anymodel_checkpoint",
...     converter="llama",
... )
Supported models:
  • llama: Llama 2, Llama 3, Llama 3.1, Llama 3.2

  • (more to come: qwen2, mistral_small, etc.)

Classes

Converter

Base class for converting HuggingFace models to Puzzletron/AnyModel format.

ConverterFactory

Factory for registering and retrieving Converter classes.

ModelDescriptor

ModelDescriptorFactory

Factory for registering and retrieving ModelDescriptor classes.

MatchingZeros

Module that returns zeros matching the input shape.

Same

Module that returns the input unchanged.

Functions

deci_x_patcher

Context manager that patches decoder layer __init__ for heterogeneous per-layer configs.

return_tuple_of_size

Create a wrapper class that returns a tuple of the given size.

convert_model

Convert a HuggingFace model to AnyModel format.

class Converter

Bases: ABC

Base class for converting HuggingFace models to Puzzletron/AnyModel format.

classmethod convert(descriptor, input_dir, output_dir)

Convert a HuggingFace model to AnyModel format.

Parameters:
  • descriptor (ModelDescriptor) – Model descriptor for the model type.

  • input_dir (Path) – Path to the input HuggingFace checkpoint.

  • output_dir (Path) – Path to the output AnyModel checkpoint.

classmethod convert_configs_in_dirs(input_dir, output_dir, trust_remote_code=False)

Convert config and add block_configs.

Parameters:
  • input_dir (Path)

  • output_dir (Path)

  • trust_remote_code (bool)

classmethod convert_model_weights(input_dir, output_dir, descriptor, num_hidden_layers)

Convert model weights to subblock format.

Parameters:
  • input_dir (Path)

  • output_dir (Path)

  • descriptor (ModelDescriptor)

  • num_hidden_layers (int)

static convert_weight_name(name)

Convert weight names during checkpoint conversion.

This method can be overridden by subclasses to apply model-specific weight name transformations when converting checkpoints from HuggingFace format to Puzzletron format.

Default implementation returns the name unchanged (identity function).

Parameters:

name (str) – Original weight name from HuggingFace checkpoint

Returns:

Converted weight name for Puzzletron format

Return type:

str

Example

For Qwen2.5-VL, this converts: - visual.* → model.visual.* - model.* → model.language_model.*

static copy_checkpoint_files(input_dir, output_dir)

Copy checkpoint files except model weights (which will be converted).

Parameters:
  • input_dir (Path)

  • output_dir (Path)

abstract static create_block_configs_from_main_config(config)

Create per-layer BlockConfig list from a HuggingFace model config.

This method extracts layer-specific parameters (e.g., intermediate_size, num_key_value_heads) from the main model config and creates a BlockConfig for each layer. These BlockConfigs enable layer-specific pruning and modifications during the compression pipeline.

Parameters:

config (PretrainedConfig) – HuggingFace PretrainedConfig (e.g., LlamaConfig, Qwen2Config)

Returns:

  • AttentionConfig: attention settings (no_op, num_key_value_heads)

  • FFNConfig: FFN settings (no_op, intermediate_size)

Return type:

List of BlockConfig, one per hidden layer. Each BlockConfig contains

Example

For a model with uniform layers (e.g., Llama):

return [BlockConfig(…)] * config.num_hidden_layers

For a model with heterogeneous layers (e.g., NemotronH with Mamba/Attention):

return [BlockConfig(…) for layer_idx in range(num_layers)]

class ConverterFactory

Bases: object

Factory for registering and retrieving Converter classes.

CLASS_MAPPING = {'gpt_oss': <class 'modelopt.torch.puzzletron.anymodel.models.gpt_oss.gpt_oss_converter.GptOssConverter'>, 'llama': <class 'modelopt.torch.puzzletron.anymodel.models.llama.llama_converter.LlamaConverter'>, 'mistral_small': <class 'modelopt.torch.puzzletron.anymodel.models.mistral_small.mistral_small_converter.MistralSmallConverter'>, 'nemotron_h': <class 'modelopt.torch.puzzletron.anymodel.models.nemotron_h.nemotron_h_converter.NemotronHConverter'>, 'nemotron_h_v2': <class 'modelopt.torch.puzzletron.anymodel.models.nemotron_h_v2.nemotron_h_v2_converter.NemotronHV2Converter'>, 'qwen2': <class 'modelopt.torch.puzzletron.anymodel.models.qwen2.qwen2_converter.Qwen2Converter'>, 'qwen3': <class 'modelopt.torch.puzzletron.anymodel.models.qwen3_8b.qwen3_8b_converter.Qwen3_8BConverter'>, 'qwen3_vl': <class 'modelopt.torch.puzzletron.anymodel.models.qwen3_vl_30b_a3b_instruct.qwen3_vl_30b_a3b_instruct_converter.Qwen3VL30BA3BInstructConverter'>}
classmethod get(value)

Get a registered converter by name or return the converter if already resolved.

Parameters:

value (str | ModelDescriptor)

classmethod register(**entries)

Register converter classes.

Raises:

KeyError – if entry key is already in type_dict and points to a different class.

Parameters:

entries (Type)

classmethod register_decorator(name)

Set up a register decorator.

Parameters:

name (str | None) – If specified, the decorated object will be registered with this name.

Returns:

Decorator that registers the callable.

Return type:

Callable

class MatchingZeros

Bases: Module

Module that returns zeros matching the input shape.

Used to replace MLP or attention layers with no-ops. Returns zeros because the hidden_states are added to the residuals, so a no-op implementation should leave the residual unchanged.

forward(hidden_states, *args, **kwargs)
class ModelDescriptor

Bases: ABC

static attn_no_op_post_init(decoder_layer)

Post-init callback to alter a decoder layer so that Attention subblock performs as no-op.

It is recommended to use the utils modules from no_op.py to replace layers to dummy counterparts.

Example for replacing a layernorm layer with identity:

>>> decoder_layer.post_attention_layernorm = Same()

Example for replacing an attention layer with zeroes:

>>> decoder_layer.self_attn = MatchingZeros()

In case the attention layer returns multiple outputs i.e hidden_states, _ = self.self_attn(), use the util method return_tuple_of_size to return trailing None values:

>>> decoder_layer.self_attn = return_tuple_of_size(MatchingZeros, size=2)()
Parameters:

decoder_layer (Module)

classmethod attn_no_op_supported()

Check whether attn_no_op_post_init is overridden for attention no-op support.

abstract static block_config_to_layer_overrides(block_config)

Map between BlockConfig and layer config overrides.

These overrides are consumed by a specific decoder layer and by the whole model. Usage can be seen in deci_x_patcher under the method _patched_decoder_layer_init.

Example implementation to override the FFN intermediate size of a block:
>>> def block_config_to_layer_overrides(block_config: BlockConfig) -> Dict[str, Any]:
>>>     return {"intermediate_size": block_config.ffn.intermediate_size}
Parameters:

block_config (BlockConfig)

Return type:

Dict[str, Any]

classmethod create_dummy_block(original_layer, block_index)

Create a dummy block to replace a layer for sharded model initialization.

Parameters:
  • original_layer (Module)

  • block_index (int)

Return type:

Module

abstract static decoder_layer_cls()

Decoder layer class types to patch for heterogeneous config support.

In most cases this class will hold as attributes both FFN & attention layers.

Returns:

nn.Module class type or a list if several class types should be patched.

Return type:

Type[Module] | List[Type[Module]]

abstract static final_norm_name()

Return the name of the final normalization layer.

static get_language_model_config(config)

Get the language model config from a PretrainedConfig.

For regular LM models, returns the config itself. For VL/multimodal models with nested configs, override to return the language model portion (e.g., config.text_config for Qwen-VL).

classmethod get_weight_groups(layer_names, num_hidden_layers)

Group model weights to support the puzzle subblock checkpointing format.

This method uses the abstract method layer_name_predicates by default.

Parameters:
  • layer_names (Iterable[str]) – state_dict layer names of the model.

  • num_hidden_layers (int) – number of decoder layers in the model.

Returns:

>>> {
...     "embedding": ["model.embed_tokens.weight"],
...     "lm_head": ["lm_head.weight", "model.norm.weight"],
...     "block_0_ffn": ["model.layers.0.mlp.down_proj", ...],
...     "block_0_attention": ["model.layers.0.self_attn.q_proj", ...],
... }

Return type:

Dictionary of group names to list of layer names per group, e.g.

abstract static init_rotary_embedding(model, runtime)

Re-initiate the rotary embeddings based on an existing model.

In puzzletron we initiate a sharded model by first creating a meta model then replacing to the actual device by loading the state_dict with the real weights.

Rotary embeddings frequencies are tensor buffers that are created dynamically during init and are not part of the model state_dict, so cannot be restored after a meta device initialization.

abstract static input_embedding_name()

Return the name of the input embedding layer.

abstract static layer_block_name(index)

Return the name of the decoder layer at the given index.

Parameters:

index (int)

abstract static layer_name_predicates(num_layers)

Return predicates for grouping model weights to support subblock checkpointing.

For every group name return a regex predicate whether a layer name is part of the group.

Returns:

Dictionary of group name to regex pattern predicate.

Parameters:

num_layers (int)

Return type:

Dict[str, Pattern]

static mlp_no_op_post_init(decoder_layer)

Post-init callback to alter a decoder layer so that FFN/mlp subblock performs as no-op.

It is recommended to use the utils modules from no_op.py to replace layers to dummy counterparts.

Example for replacing a layernorm layer with identity:

>>> decoder_layer.post_attention_layernorm = Same()

Example for replacing an MLP layer with zeroes (zeroes since hidden_states are added to the residuals hidden_states so a no-op implementation will leave residual the same):

>>> decoder_layer.mlp = MatchingZeros()

In case the MLP layer to replace returns multiple outputs i.e hidden_states, _ = self.mlp(), use the util method return_tuple_of_size to return trailing None values:

>>> decoder_layer.mlp = return_tuple_of_size(MatchingZeros, size=2)()
Parameters:

decoder_layer (Module)

classmethod mlp_no_op_supported()

Check whether mlp_no_op_post_init is overridden for mlp no-op support.

Return type:

bool

abstract static output_embedding_name()

Return the name of the output embedding layer.

static requires_trust_remote_code()

Whether this model descriptor requires trust_remote_code=True for loading.

Models that use custom code (e.g., via auto_map in config) should override this to return True.

Returns:

True if trust_remote_code=True is required, False otherwise.

Return type:

bool

static uses_autocast()

Whether this model supports torch.autocast.

Some models (e.g., Qwen3-VL MoE) have dtype bugs under autocast. Override and return False for models that do not support autocast.

Return type:

bool

class ModelDescriptorFactory

Bases: object

Factory for registering and retrieving ModelDescriptor classes.

CLASS_MAPPING = {'gpt_oss': <class 'modelopt.torch.puzzletron.anymodel.models.gpt_oss.gpt_oss_model_descriptor.GptOssModelDescriptor'>, 'llama': <class 'modelopt.torch.puzzletron.anymodel.models.llama.llama_model_descriptor.LlamaModelDescriptor'>, 'mistral_small': <class 'modelopt.torch.puzzletron.anymodel.models.mistral_small.mistral_small_model_descriptor.MistralSmallModelDescriptor'>, 'nemotron_h': <class 'modelopt.torch.puzzletron.anymodel.models.nemotron_h.nemotron_h_model_descriptor.NemotronHModelDescriptor'>, 'nemotron_h_v2': <class 'modelopt.torch.puzzletron.anymodel.models.nemotron_h_v2.nemotron_h_v2_model_descriptor.NemotronHV2ModelDescriptor'>, 'qwen2': <class 'modelopt.torch.puzzletron.anymodel.models.qwen2.qwen2_model_descriptor.Qwen2ModelDescriptor'>, 'qwen3': <class 'modelopt.torch.puzzletron.anymodel.models.qwen3_8b.qwen3_8b_model_descriptor.Qwen3_8BModelDescriptor'>, 'qwen3_vl': <class 'modelopt.torch.puzzletron.anymodel.models.qwen3_vl_30b_a3b_instruct.qwen3_vl_30b_a3b_instruct_model_descriptor.Qwen3VL30BA3BInstructModelDescriptor'>}
classmethod get(value)

Get a registered model descriptor by name or return the descriptor if already resolved.

Parameters:

value (str | ModelDescriptor)

classmethod register(**entries)

Register model descriptor classes.

Raises:

KeyError – if entry key is already in type_dict and points to a different class.

Parameters:

entries (Type)

classmethod register_decorator(name)

Set up a register decorator.

Parameters:

name (str | None) – If specified, the decorated object will be registered with this name.

Returns:

Decorator that registers the callable.

Return type:

Callable

class Same

Bases: Module

Module that returns the input unchanged.

Used to replace normalization layers with identity operations.

forward(hidden_states, *args, **kwargs)
property weight

Support NemotronH with scoring_activations, when lm_head is called self.lm_head.weight.dtype.

convert_model(input_dir, output_dir, converter)

Convert a HuggingFace model to AnyModel format.

This function converts a HuggingFace checkpoint to the AnyModel format used for compression. The conversion process:

  1. Copies non-weight files (config, tokenizer, etc.)

  2. Creates block_configs for each layer

  3. Reorganizes weights into subblock checkpoints

Parameters:
  • input_dir (str) – Path to the input HuggingFace checkpoint directory.

  • output_dir (str) – Path to the output AnyModel checkpoint directory.

  • converter (Converter | str) – Either a converter name (e.g., “llama”) or a Converter class.

Example

>>> convert_model(
...     input_dir="/path/to/Llama-3.1-8B-Instruct",
...     output_dir="/path/to/output/ckpts/teacher",
...     converter="llama",
... )
deci_x_patcher(model_descriptor, block_configs=None)

Context manager that patches decoder layer __init__ for heterogeneous per-layer configs.

This is the core mechanism that enables AnyModel to work with any HuggingFace model. It patches the decoder layer class(es) to read per-layer block_configs and apply layer-specific overrides (e.g., different intermediate_size per layer).

Parameters:
  • model_descriptor (ModelDescriptor) – The model descriptor that defines which classes to patch and how to map block_configs to layer overrides.

  • block_configs (List[BlockConfig | dict] | None) – Optional list of BlockConfig (one per layer). If not provided, will try to read from config.block_configs during model initialization.

Example

>>> with deci_x_patcher(LlamaModelDescriptor, block_configs):
...     model = AutoModelForCausalLM.from_config(config)
return_tuple_of_size(cls, size)

Create a wrapper class that returns a tuple of the given size.

Useful for replacing modules that return multiple outputs (e.g., attention layers that return (hidden_states, attn_weights)).

Parameters:
  • cls (type[Module]) – The base module class to wrap.

  • size (int) – The size of the tuple to return.

Returns:

A new class that wraps the base class and returns a tuple of the given size.

Return type:

type[Module]

Example

>>> decoder_layer.self_attn = return_tuple_of_size(MatchingZeros, size=2)()