anymodel

AnyModel: Architecture-agnostic model compression for HuggingFace models.

This module provides a declarative approach to model compression that works with any HuggingFace model without requiring custom modeling code. Instead of duplicating HuggingFace modeling classes, AnyModel uses ModelDescriptors that define:

Which decoder layer class(es) to patch for heterogeneous configs
How to map BlockConfig to layer-specific overrides
Weight name patterns for subblock checkpointing

Example usage:

>>> from modelopt.torch.puzzletron.anymodel import convert_model
>>> convert_model(
...     input_dir="path/to/hf_checkpoint",
...     output_dir="path/to/anymodel_checkpoint",
...     converter="llama",
... )

Supported models:

llama: Llama 2, Llama 3, Llama 3.1, Llama 3.2
(more to come: qwen2, mistral_small, etc.)

Classes

`Converter`	Base class for converting HuggingFace models to Puzzletron/AnyModel format.
`ConverterFactory`	Factory for registering and retrieving Converter classes.
`ModelDescriptor`
`ModelDescriptorFactory`	Factory for registering and retrieving ModelDescriptor classes.
`MatchingZeros`	Module that returns zeros matching the input shape.
`Same`	Module that returns the input unchanged.

Functions

`deci_x_patcher`	Context manager that patches decoder layer __init__ for heterogeneous per-layer configs.
`return_tuple_of_size`	Create a wrapper class that returns a tuple of the given size.
`convert_model`	Convert a HuggingFace model to AnyModel format.

class Converter

Bases: ABC

Base class for converting HuggingFace models to Puzzletron/AnyModel format.

classmethod convert(descriptor, input_dir, output_dir)

Convert a HuggingFace model to AnyModel format.

Parameters:

descriptor (ModelDescriptor) – Model descriptor for the model type.
input_dir (Path) – Path to the input HuggingFace checkpoint.
output_dir (Path) – Path to the output AnyModel checkpoint.

classmethod convert_configs_in_dirs(input_dir, output_dir, trust_remote_code=False)

Convert config and add block_configs.

Parameters:

input_dir (Path)
output_dir (Path)
trust_remote_code (bool)

classmethod convert_model_weights(input_dir, output_dir, descriptor, num_hidden_layers)

Convert model weights to subblock format.

Parameters:

input_dir (Path)
output_dir (Path)
descriptor (ModelDescriptor)
num_hidden_layers (int)

static convert_weight_name(name)

Convert weight names during checkpoint conversion.

This method can be overridden by subclasses to apply model-specific weight name transformations when converting checkpoints from HuggingFace format to Puzzletron format.

Default implementation returns the name unchanged (identity function).

Parameters:: name (str) – Original weight name from HuggingFace checkpoint
Returns:: Converted weight name for Puzzletron format
Return type:: str

Example

For Qwen2.5-VL, this converts: - visual.* → model.visual.* - model.* → model.language_model.*

static copy_checkpoint_files(input_dir, output_dir)

Copy checkpoint files except model weights (which will be converted).

Parameters:

input_dir (Path)
output_dir (Path)

abstract static create_block_configs_from_main_config(config)

Create per-layer BlockConfig list from a HuggingFace model config.

This method extracts layer-specific parameters (e.g., intermediate_size, num_key_value_heads) from the main model config and creates a BlockConfig for each layer. These BlockConfigs enable layer-specific pruning and modifications during the compression pipeline.

Parameters:

config (PretrainedConfig) – HuggingFace PretrainedConfig (e.g., LlamaConfig, Qwen2Config)

Returns:

AttentionConfig: attention settings (no_op, num_key_value_heads)
FFNConfig: FFN settings (no_op, intermediate_size)

Return type:

List of BlockConfig, one per hidden layer. Each BlockConfig contains

Example

For a model with uniform layers (e.g., Llama):: return [BlockConfig(…)] * config.num_hidden_layers
For a model with heterogeneous layers (e.g., NemotronH with Mamba/Attention):: return [BlockConfig(…) for layer_idx in range(num_layers)]

class ConverterFactory

Bases: object

Factory for registering and retrieving Converter classes.

CLASS_MAPPING = {'gpt_oss': <class 'modelopt.torch.puzzletron.anymodel.models.gpt_oss.gpt_oss_converter.GptOssConverter'>, 'llama': <class 'modelopt.torch.puzzletron.anymodel.models.llama.llama_converter.LlamaConverter'>, 'mistral_small': <class 'modelopt.torch.puzzletron.anymodel.models.mistral_small.mistral_small_converter.MistralSmallConverter'>, 'nemotron_h': <class 'modelopt.torch.puzzletron.anymodel.models.nemotron_h.nemotron_h_converter.NemotronHConverter'>, 'nemotron_h_v2': <class 'modelopt.torch.puzzletron.anymodel.models.nemotron_h_v2.nemotron_h_v2_converter.NemotronHV2Converter'>, 'qwen2': <class 'modelopt.torch.puzzletron.anymodel.models.qwen2.qwen2_converter.Qwen2Converter'>, 'qwen3': <class 'modelopt.torch.puzzletron.anymodel.models.qwen3_8b.qwen3_8b_converter.Qwen3_8BConverter'>, 'qwen3_vl': <class 'modelopt.torch.puzzletron.anymodel.models.qwen3_vl_30b_a3b_instruct.qwen3_vl_30b_a3b_instruct_converter.Qwen3VL30BA3BInstructConverter'>}

classmethod get(value)

Get a registered converter by name or return the converter if already resolved.

Parameters:: value (str | ModelDescriptor)

classmethod register(**entries)

Register converter classes.

Raises:: KeyError – if entry key is already in type_dict and points to a different class.
Parameters:: entries (Type)

classmethod register_decorator(name)

Set up a register decorator.

Parameters:: name (str | None) – If specified, the decorated object will be registered with this name.
Returns:: Decorator that registers the callable.
Return type:: Callable

class MatchingZeros

Bases: Module

Module that returns zeros matching the input shape.

Used to replace MLP or attention layers with no-ops. Returns zeros because the hidden_states are added to the residuals, so a no-op implementation should leave the residual unchanged.

forward(hidden_states, *args, **kwargs)

class ModelDescriptor

Bases: ABC

static attn_no_op_post_init(decoder_layer)

Post-init callback to alter a decoder layer so that Attention subblock performs as no-op.

It is recommended to use the utils modules from no_op.py to replace layers to dummy counterparts.

Example for replacing a layernorm layer with identity:

>>> decoder_layer.post_attention_layernorm = Same()

Example for replacing an attention layer with zeroes:

>>> decoder_layer.self_attn = MatchingZeros()

In case the attention layer returns multiple outputs i.e hidden_states, _ = self.self_attn(), use the util method return_tuple_of_size to return trailing None values:

>>> decoder_layer.self_attn = return_tuple_of_size(MatchingZeros, size=2)()

Parameters:: decoder_layer (Module)

classmethod attn_no_op_supported(): Check whether attn_no_op_post_init is overridden for attention no-op support.

abstract static block_config_to_layer_overrides(block_config)

Map between BlockConfig and layer config overrides.

These overrides are consumed by a specific decoder layer and by the whole model. Usage can be seen in deci_x_patcher under the method _patched_decoder_layer_init.

Example implementation to override the FFN intermediate size of a block:

>>> def block_config_to_layer_overrides(block_config: BlockConfig) -> Dict[str, Any]:
>>>     return {"intermediate_size": block_config.ffn.intermediate_size}

Parameters:: block_config (BlockConfig)
Return type:: Dict[str, Any]

classmethod create_dummy_block(original_layer, block_index)

Create a dummy block to replace a layer for sharded model initialization.

Parameters:

original_layer (Module)
block_index (int)

Return type:

Module

abstract static decoder_layer_cls()

Decoder layer class types to patch for heterogeneous config support.

In most cases this class will hold as attributes both FFN & attention layers.

Returns:: nn.Module class type or a list if several class types should be patched.
Return type:: Type[Module] | List[Type[Module]]

abstract static final_norm_name(): Return the name of the final normalization layer.

static get_language_model_config(config)

Get the language model config from a PretrainedConfig.

For regular LM models, returns the config itself. For VL/multimodal models with nested configs, override to return the language model portion (e.g., config.text_config for Qwen-VL).

classmethod get_weight_groups(layer_names, num_hidden_layers)

Group model weights to support the puzzle subblock checkpointing format.

This method uses the abstract method layer_name_predicates by default.

Parameters:

layer_names (Iterable[str]) – state_dict layer names of the model.
num_hidden_layers (int) – number of decoder layers in the model.

Returns:

>>> {
...     "embedding": ["model.embed_tokens.weight"],
...     "lm_head": ["lm_head.weight", "model.norm.weight"],
...     "block_0_ffn": ["model.layers.0.mlp.down_proj", ...],
...     "block_0_attention": ["model.layers.0.self_attn.q_proj", ...],
... }

Return type:

Dictionary of group names to list of layer names per group, e.g.

abstract static init_rotary_embedding(model, runtime)

Re-initiate the rotary embeddings based on an existing model.

In puzzletron we initiate a sharded model by first creating a meta model then replacing to the actual device by loading the state_dict with the real weights.

Rotary embeddings frequencies are tensor buffers that are created dynamically during init and are not part of the model state_dict, so cannot be restored after a meta device initialization.

abstract static input_embedding_name(): Return the name of the input embedding layer.

abstract static layer_block_name(index)

Return the name of the decoder layer at the given index.

Parameters:: index (int)

abstract static layer_name_predicates(num_layers)

Return predicates for grouping model weights to support subblock checkpointing.

For every group name return a regex predicate whether a layer name is part of the group.

Returns:: Dictionary of group name to regex pattern predicate.
Parameters:: num_layers (int)
Return type:: Dict[str, Pattern]

static mlp_no_op_post_init(decoder_layer)

Post-init callback to alter a decoder layer so that FFN/mlp subblock performs as no-op.

It is recommended to use the utils modules from no_op.py to replace layers to dummy counterparts.

Example for replacing a layernorm layer with identity:

>>> decoder_layer.post_attention_layernorm = Same()

Example for replacing an MLP layer with zeroes (zeroes since hidden_states are added to the residuals hidden_states so a no-op implementation will leave residual the same):

>>> decoder_layer.mlp = MatchingZeros()

In case the MLP layer to replace returns multiple outputs i.e hidden_states, _ = self.mlp(), use the util method return_tuple_of_size to return trailing None values:

>>> decoder_layer.mlp = return_tuple_of_size(MatchingZeros, size=2)()

Parameters:: decoder_layer (Module)

classmethod mlp_no_op_supported()

Check whether mlp_no_op_post_init is overridden for mlp no-op support.

Return type:: bool

abstract static output_embedding_name(): Return the name of the output embedding layer.

static requires_trust_remote_code()

Whether this model descriptor requires trust_remote_code=True for loading.

Models that use custom code (e.g., via auto_map in config) should override this to return True.

Returns:: True if trust_remote_code=True is required, False otherwise.
Return type:: bool

static uses_autocast()

Whether this model supports torch.autocast.

Some models (e.g., Qwen3-VL MoE) have dtype bugs under autocast. Override and return False for models that do not support autocast.

Return type:: bool

class ModelDescriptorFactory

Bases: object

Factory for registering and retrieving ModelDescriptor classes.

CLASS_MAPPING = {'gpt_oss': <class 'modelopt.torch.puzzletron.anymodel.models.gpt_oss.gpt_oss_model_descriptor.GptOssModelDescriptor'>, 'llama': <class 'modelopt.torch.puzzletron.anymodel.models.llama.llama_model_descriptor.LlamaModelDescriptor'>, 'mistral_small': <class 'modelopt.torch.puzzletron.anymodel.models.mistral_small.mistral_small_model_descriptor.MistralSmallModelDescriptor'>, 'nemotron_h': <class 'modelopt.torch.puzzletron.anymodel.models.nemotron_h.nemotron_h_model_descriptor.NemotronHModelDescriptor'>, 'nemotron_h_v2': <class 'modelopt.torch.puzzletron.anymodel.models.nemotron_h_v2.nemotron_h_v2_model_descriptor.NemotronHV2ModelDescriptor'>, 'qwen2': <class 'modelopt.torch.puzzletron.anymodel.models.qwen2.qwen2_model_descriptor.Qwen2ModelDescriptor'>, 'qwen3': <class 'modelopt.torch.puzzletron.anymodel.models.qwen3_8b.qwen3_8b_model_descriptor.Qwen3_8BModelDescriptor'>, 'qwen3_vl': <class 'modelopt.torch.puzzletron.anymodel.models.qwen3_vl_30b_a3b_instruct.qwen3_vl_30b_a3b_instruct_model_descriptor.Qwen3VL30BA3BInstructModelDescriptor'>}

classmethod get(value)

Get a registered model descriptor by name or return the descriptor if already resolved.

Parameters:: value (str | ModelDescriptor)

classmethod register(**entries)

Register model descriptor classes.

Raises:: KeyError – if entry key is already in type_dict and points to a different class.
Parameters:: entries (Type)

classmethod register_decorator(name)

Set up a register decorator.

Parameters:: name (str | None) – If specified, the decorated object will be registered with this name.
Returns:: Decorator that registers the callable.
Return type:: Callable

class Same

Bases: Module

Module that returns the input unchanged.

Used to replace normalization layers with identity operations.

forward(hidden_states, *args, **kwargs)

property weight: Support NemotronH with scoring_activations, when lm_head is called self.lm_head.weight.dtype.

convert_model(input_dir, output_dir, converter)

Convert a HuggingFace model to AnyModel format.

This function converts a HuggingFace checkpoint to the AnyModel format used for compression. The conversion process:

Copies non-weight files (config, tokenizer, etc.)
Creates block_configs for each layer
Reorganizes weights into subblock checkpoints

Parameters:

input_dir (str) – Path to the input HuggingFace checkpoint directory.
output_dir (str) – Path to the output AnyModel checkpoint directory.
converter (Converter | str) – Either a converter name (e.g., “llama”) or a Converter class.

Example

>>> convert_model(
...     input_dir="/path/to/Llama-3.1-8B-Instruct",
...     output_dir="/path/to/output/ckpts/teacher",
...     converter="llama",
... )

deci_x_patcher(model_descriptor, block_configs=None)

Context manager that patches decoder layer __init__ for heterogeneous per-layer configs.

This is the core mechanism that enables AnyModel to work with any HuggingFace model. It patches the decoder layer class(es) to read per-layer block_configs and apply layer-specific overrides (e.g., different intermediate_size per layer).

Parameters:

model_descriptor (ModelDescriptor) – The model descriptor that defines which classes to patch and how to map block_configs to layer overrides.
block_configs (List[BlockConfig | dict] | None) – Optional list of BlockConfig (one per layer). If not provided, will try to read from config.block_configs during model initialization.

Example

>>> with deci_x_patcher(LlamaModelDescriptor, block_configs):
...     model = AutoModelForCausalLM.from_config(config)

return_tuple_of_size(cls, size)

Create a wrapper class that returns a tuple of the given size.

Useful for replacing modules that return multiple outputs (e.g., attention layers that return (hidden_states, attn_weights)).

Parameters:

cls (type[Module]) – The base module class to wrap.
size (int) – The size of the tuple to return.

Returns:

A new class that wraps the base class and returns a tuple of the given size.

Return type:

type[Module]

Example

>>> decoder_layer.self_attn = return_tuple_of_size(MatchingZeros, size=2)()