unified_export_hf

Code that export quantized Hugging Face models for deployment.

Functions

`export_hf_checkpoint`	Export quantized HuggingFace model checkpoint (transformers or diffusers).
`export_speculative_decoding`	Export speculative decoding HuggingFace model checkpoint.

export_hf_checkpoint(model, dtype=None, export_dir='/tmp', save_modelopt_state=False, components=None, extra_state_dict=None, max_shard_size='10GB', **kwargs)

Export quantized HuggingFace model checkpoint (transformers or diffusers).

This function automatically detects whether the model is from transformers or diffusers and applies the appropriate export logic.

Parameters:

model (Any) – The full torch model to export. The actual quantized model may be a submodule. Supports both transformers models (e.g., LlamaForCausalLM) and diffusers models/pipelines (e.g., StableDiffusionPipeline, UNet2DConditionModel).
dtype (dtype | None) – The weights data type to export the unquantized layers or the default model data type if None.
export_dir (Path | str) – The target export path.
save_modelopt_state (bool) – Whether to save the modelopt state_dict.
components (list[str] | None) – Only used for diffusers pipelines. Optional list of component names to export. If None, all quantized components are exported.
extra_state_dict (dict[str, Tensor] | None) – Extra state dictionary to add to the exported model.
max_shard_size (int | str) – Maximum size of each safetensors shard file. Defaults to “10GB”.
**kwargs – Runtime-specific post-processing options forwarded to _postprocess_safetensors() for diffusion model exports. See its docstring for supported keys.

export_speculative_decoding(model, dtype=None, export_dir='/tmp')

Export speculative decoding HuggingFace model checkpoint.

Parameters:

model (Module)
dtype (dtype | None)
export_dir (Path | str)

Return type:

None