anymodel
AnyModel: Architecture-agnostic model compression for HuggingFace models.
This module provides a declarative approach to model compression that works with any HuggingFace model without requiring custom modeling code. Instead of duplicating HuggingFace modeling classes, AnyModel uses ModelDescriptors that define:
Which decoder layer class(es) to patch for heterogeneous configs
How to map BlockConfig to layer-specific overrides
Weight name patterns for subblock checkpointing
- Example usage:
>>> from modelopt.torch.puzzletron.anymodel import convert_model >>> convert_model( ... input_dir="path/to/hf_checkpoint", ... output_dir="path/to/anymodel_checkpoint", ... converter="llama", ... )
- Supported models:
llama: Llama 2, Llama 3, Llama 3.1, Llama 3.2
(more to come: qwen2, mistral_small, etc.)
Classes
Base class for converting HuggingFace models to Puzzletron/AnyModel format. |
|
Factory for registering and retrieving Converter classes. |
|
Factory for registering and retrieving ModelDescriptor classes. |
|
Module that returns zeros matching the input shape. |
|
Module that returns the input unchanged. |
Functions
Context manager that patches decoder layer __init__ for heterogeneous per-layer configs. |
|
Create a wrapper class that returns a tuple of the given size. |
|
Convert a HuggingFace model to AnyModel format. |
- class Converter
Bases:
ABCBase class for converting HuggingFace models to Puzzletron/AnyModel format.
- classmethod convert(descriptor, input_dir, output_dir)
Convert a HuggingFace model to AnyModel format.
- Parameters:
descriptor (ModelDescriptor) – Model descriptor for the model type.
input_dir (Path) – Path to the input HuggingFace checkpoint.
output_dir (Path) – Path to the output AnyModel checkpoint.
- classmethod convert_configs_in_dirs(input_dir, output_dir, trust_remote_code=False)
Convert config and add block_configs.
- Parameters:
input_dir (Path)
output_dir (Path)
trust_remote_code (bool)
- classmethod convert_model_weights(input_dir, output_dir, descriptor, num_hidden_layers)
Convert model weights to subblock format.
- Parameters:
input_dir (Path)
output_dir (Path)
descriptor (ModelDescriptor)
num_hidden_layers (int)
- static convert_weight_name(name)
Convert weight names during checkpoint conversion.
This method can be overridden by subclasses to apply model-specific weight name transformations when converting checkpoints from HuggingFace format to Puzzletron format.
Default implementation returns the name unchanged (identity function).
- Parameters:
name (str) – Original weight name from HuggingFace checkpoint
- Returns:
Converted weight name for Puzzletron format
- Return type:
str
Example
For Qwen2.5-VL, this converts: - visual.* → model.visual.* - model.* → model.language_model.*
- static copy_checkpoint_files(input_dir, output_dir)
Copy checkpoint files except model weights (which will be converted).
- Parameters:
input_dir (Path)
output_dir (Path)
- abstract static create_block_configs_from_main_config(config)
Create per-layer BlockConfig list from a HuggingFace model config.
This method extracts layer-specific parameters (e.g., intermediate_size, num_key_value_heads) from the main model config and creates a BlockConfig for each layer. These BlockConfigs enable layer-specific pruning and modifications during the compression pipeline.
- Parameters:
config (PretrainedConfig) – HuggingFace PretrainedConfig (e.g., LlamaConfig, Qwen2Config)
- Returns:
AttentionConfig: attention settings (no_op, num_key_value_heads)
FFNConfig: FFN settings (no_op, intermediate_size)
- Return type:
List of BlockConfig, one per hidden layer. Each BlockConfig contains
Example
- For a model with uniform layers (e.g., Llama):
return [BlockConfig(…)] * config.num_hidden_layers
- For a model with heterogeneous layers (e.g., NemotronH with Mamba/Attention):
return [BlockConfig(…) for layer_idx in range(num_layers)]
- class ConverterFactory
Bases:
objectFactory for registering and retrieving Converter classes.
- CLASS_MAPPING = {'gpt_oss': <class 'modelopt.torch.puzzletron.anymodel.models.gpt_oss.gpt_oss_converter.GptOssConverter'>, 'llama': <class 'modelopt.torch.puzzletron.anymodel.models.llama.llama_converter.LlamaConverter'>, 'mistral_small': <class 'modelopt.torch.puzzletron.anymodel.models.mistral_small.mistral_small_converter.MistralSmallConverter'>, 'nemotron_h': <class 'modelopt.torch.puzzletron.anymodel.models.nemotron_h.nemotron_h_converter.NemotronHConverter'>, 'nemotron_h_v2': <class 'modelopt.torch.puzzletron.anymodel.models.nemotron_h_v2.nemotron_h_v2_converter.NemotronHV2Converter'>, 'qwen2': <class 'modelopt.torch.puzzletron.anymodel.models.qwen2.qwen2_converter.Qwen2Converter'>, 'qwen3': <class 'modelopt.torch.puzzletron.anymodel.models.qwen3.qwen3_converter.Qwen3Converter'>, 'qwen3_vl': <class 'modelopt.torch.puzzletron.anymodel.models.qwen3_vl.qwen3_vl_converter.Qwen3VLConverter'>}
- classmethod get(value)
Get a registered converter by name or return the converter if already resolved.
- Parameters:
value (str | ModelDescriptor)
- classmethod register(**entries)
Register converter classes.
- Raises:
KeyError – if entry key is already in type_dict and points to a different class.
- Parameters:
entries (Type)
- classmethod register_decorator(name)
Set up a register decorator.
- Parameters:
name (str | None) – If specified, the decorated object will be registered with this name.
- Returns:
Decorator that registers the callable.
- Return type:
Callable
- class MatchingZeros
Bases:
ModuleModule that returns zeros matching the input shape.
Used to replace MLP or attention layers with no-ops. Returns zeros because the hidden_states are added to the residuals, so a no-op implementation should leave the residual unchanged.
- forward(hidden_states, *args, **kwargs)
- class ModelDescriptor
Bases:
ABC- static attn_no_op_post_init(decoder_layer)
Post-init callback to alter a decoder layer so that Attention subblock performs as no-op.
It is recommended to use the utils modules from no_op.py to replace layers to dummy counterparts.
Example for replacing a layernorm layer with identity:
>>> decoder_layer.post_attention_layernorm = Same()
Example for replacing an attention layer with zeroes:
>>> decoder_layer.self_attn = MatchingZeros()
In case the attention layer returns multiple outputs i.e hidden_states, _ = self.self_attn(), use the util method return_tuple_of_size to return trailing None values:
>>> decoder_layer.self_attn = return_tuple_of_size(MatchingZeros, size=2)()
- Parameters:
decoder_layer (Module)
- classmethod attn_no_op_supported()
Check whether attn_no_op_post_init is overridden for attention no-op support.
- abstract static block_config_to_layer_overrides(block_config)
Map between BlockConfig and layer config overrides.
These overrides are consumed by a specific decoder layer and by the whole model. Usage can be seen in deci_x_patcher under the method _patched_decoder_layer_init.
- Example implementation to override the FFN intermediate size of a block:
>>> def block_config_to_layer_overrides(block_config: BlockConfig) -> Dict[str, Any]: >>> return {"intermediate_size": block_config.ffn.intermediate_size}
- Parameters:
block_config (BlockConfig)
- Return type:
Dict[str, Any]
- classmethod create_dummy_block(original_layer, block_index)
Create a dummy block to replace a layer for sharded model initialization.
- Parameters:
original_layer (Module)
block_index (int)
- Return type:
Module
- abstract static decoder_layer_cls()
Decoder layer class types to patch for heterogeneous config support.
In most cases this class will hold as attributes both FFN & attention layers.
- Returns:
nn.Module class type or a list if several class types should be patched.
- Return type:
Type[Module] | List[Type[Module]]
- abstract static final_norm_name()
Return the name of the final normalization layer.
- static get_language_model_config(config)
Get the language model config from a PretrainedConfig.
For regular LM models, returns the config itself. For VL/multimodal models with nested configs, override to return the language model portion (e.g., config.text_config for Qwen-VL).
- classmethod get_weight_groups(layer_names, num_hidden_layers)
Group model weights to support the puzzle subblock checkpointing format.
This method uses the abstract method layer_name_predicates by default.
- Parameters:
layer_names (Iterable[str]) – state_dict layer names of the model.
num_hidden_layers (int) – number of decoder layers in the model.
- Returns:
>>> { ... "embedding": ["model.embed_tokens.weight"], ... "lm_head": ["lm_head.weight", "model.norm.weight"], ... "block_0_ffn": ["model.layers.0.mlp.down_proj", ...], ... "block_0_attention": ["model.layers.0.self_attn.q_proj", ...], ... }
- Return type:
Dictionary of group names to list of layer names per group, e.g.
- abstract static init_rotary_embedding(model, runtime)
Re-initiate the rotary embeddings based on an existing model.
In puzzletron we initiate a sharded model by first creating a meta model then replacing to the actual device by loading the state_dict with the real weights.
Rotary embeddings frequencies are tensor buffers that are created dynamically during init and are not part of the model state_dict, so cannot be restored after a meta device initialization.
- abstract static input_embedding_name()
Return the name of the input embedding layer.
- abstract static layer_block_name(index)
Return the name of the decoder layer at the given index.
- Parameters:
index (int)
- abstract static layer_name_predicates(num_layers)
Return predicates for grouping model weights to support subblock checkpointing.
For every group name return a regex predicate whether a layer name is part of the group.
- Returns:
Dictionary of group name to regex pattern predicate.
- Parameters:
num_layers (int)
- Return type:
Dict[str, Pattern]
- static mlp_no_op_post_init(decoder_layer)
Post-init callback to alter a decoder layer so that FFN/mlp subblock performs as no-op.
It is recommended to use the utils modules from no_op.py to replace layers to dummy counterparts.
Example for replacing a layernorm layer with identity:
>>> decoder_layer.post_attention_layernorm = Same()
Example for replacing an MLP layer with zeroes (zeroes since hidden_states are added to the residuals hidden_states so a no-op implementation will leave residual the same):
>>> decoder_layer.mlp = MatchingZeros()
In case the MLP layer to replace returns multiple outputs i.e hidden_states, _ = self.mlp(), use the util method return_tuple_of_size to return trailing None values:
>>> decoder_layer.mlp = return_tuple_of_size(MatchingZeros, size=2)()
- Parameters:
decoder_layer (Module)
- classmethod mlp_no_op_supported()
Check whether mlp_no_op_post_init is overridden for mlp no-op support.
- Return type:
bool
- abstract static output_embedding_name()
Return the name of the output embedding layer.
- static requires_trust_remote_code()
Whether this model descriptor requires trust_remote_code=True for loading.
Models that use custom code (e.g., via auto_map in config) should override this to return True.
- Returns:
True if trust_remote_code=True is required, False otherwise.
- Return type:
bool
- static uses_autocast()
Whether this model supports torch.autocast.
Some models (e.g., Qwen3-VL MoE) have dtype bugs under autocast. Override and return False for models that do not support autocast.
- Return type:
bool
- class ModelDescriptorFactory
Bases:
objectFactory for registering and retrieving ModelDescriptor classes.
- CLASS_MAPPING = {'gpt_oss': <class 'modelopt.torch.puzzletron.anymodel.models.gpt_oss.gpt_oss_model_descriptor.GptOssModelDescriptor'>, 'llama': <class 'modelopt.torch.puzzletron.anymodel.models.llama.llama_model_descriptor.LlamaModelDescriptor'>, 'mistral_small': <class 'modelopt.torch.puzzletron.anymodel.models.mistral_small.mistral_small_model_descriptor.MistralSmallModelDescriptor'>, 'nemotron_h': <class 'modelopt.torch.puzzletron.anymodel.models.nemotron_h.nemotron_h_model_descriptor.NemotronHModelDescriptor'>, 'nemotron_h_v2': <class 'modelopt.torch.puzzletron.anymodel.models.nemotron_h_v2.nemotron_h_v2_model_descriptor.NemotronHV2ModelDescriptor'>, 'qwen2': <class 'modelopt.torch.puzzletron.anymodel.models.qwen2.qwen2_model_descriptor.Qwen2ModelDescriptor'>, 'qwen3': <class 'modelopt.torch.puzzletron.anymodel.models.qwen3.qwen3_model_descriptor.Qwen3ModelDescriptor'>, 'qwen3_vl': <class 'modelopt.torch.puzzletron.anymodel.models.qwen3_vl.qwen3_vl_model_descriptor.Qwen3VLModelDescriptor'>}
- classmethod get(value)
Get a registered model descriptor by name or return the descriptor if already resolved.
- Parameters:
value (str | ModelDescriptor)
- classmethod register(**entries)
Register model descriptor classes.
- Raises:
KeyError – if entry key is already in type_dict and points to a different class.
- Parameters:
entries (Type)
- classmethod register_decorator(name)
Set up a register decorator.
- Parameters:
name (str | None) – If specified, the decorated object will be registered with this name.
- Returns:
Decorator that registers the callable.
- Return type:
Callable
- class Same
Bases:
ModuleModule that returns the input unchanged.
Used to replace normalization layers with identity operations.
- forward(hidden_states, *args, **kwargs)
- property weight
Support NemotronH with scoring_activations, when lm_head is called self.lm_head.weight.dtype.
- convert_model(input_dir, output_dir, converter)
Convert a HuggingFace model to AnyModel format.
This function converts a HuggingFace checkpoint to the AnyModel format used for compression. The conversion process:
Copies non-weight files (config, tokenizer, etc.)
Creates block_configs for each layer
Reorganizes weights into subblock checkpoints
- Parameters:
input_dir (str) – Path to the input HuggingFace checkpoint directory.
output_dir (str) – Path to the output AnyModel checkpoint directory.
converter (Converter | str) – Either a converter name (e.g., “llama”) or a Converter class.
Example
>>> convert_model( ... input_dir="/path/to/Llama-3.1-8B-Instruct", ... output_dir="/path/to/output/ckpts/teacher", ... converter="llama", ... )
- deci_x_patcher(model_descriptor, block_configs=None)
Context manager that patches decoder layer __init__ for heterogeneous per-layer configs.
This is the core mechanism that enables AnyModel to work with any HuggingFace model. It patches the decoder layer class(es) to read per-layer block_configs and apply layer-specific overrides (e.g., different intermediate_size per layer).
- Parameters:
model_descriptor (ModelDescriptor) – The model descriptor that defines which classes to patch and how to map block_configs to layer overrides.
block_configs (List[BlockConfig | dict] | None) – Optional list of BlockConfig (one per layer). If not provided, will try to read from config.block_configs during model initialization.
Example
>>> with deci_x_patcher(LlamaModelDescriptor, block_configs): ... model = AutoModelForCausalLM.from_config(config)
- return_tuple_of_size(cls, size)
Create a wrapper class that returns a tuple of the given size.
Useful for replacing modules that return multiple outputs (e.g., attention layers that return (hidden_states, attn_weights)).
- Parameters:
cls (type[Module]) – The base module class to wrap.
size (int) – The size of the tuple to return.
- Returns:
A new class that wraps the base class and returns a tuple of the given size.
- Return type:
type[Module]
Example
>>> decoder_layer.self_attn = return_tuple_of_size(MatchingZeros, size=2)()