unified_export_hf
Code that export quantized Hugging Face models for deployment.
Functions
Export quantized HuggingFace model checkpoint (transformers or diffusers). |
|
Export speculative decoding HuggingFace model checkpoint. |
- export_hf_checkpoint(model, dtype=None, export_dir='/tmp', save_modelopt_state=False, components=None, extra_state_dict=None, max_shard_size='10GB', **kwargs)
Export quantized HuggingFace model checkpoint (transformers or diffusers).
This function automatically detects whether the model is from transformers or diffusers and applies the appropriate export logic.
- Parameters:
model (Any) – The full torch model to export. The actual quantized model may be a submodule. Supports both transformers models (e.g., LlamaForCausalLM) and diffusers models/pipelines (e.g., StableDiffusionPipeline, UNet2DConditionModel).
dtype (dtype | None) – The weights data type to export the unquantized layers or the default model data type if None.
export_dir (Path | str) – The target export path.
save_modelopt_state (bool) – Whether to save the modelopt state_dict.
components (list[str] | None) – Only used for diffusers pipelines. Optional list of component names to export. If None, all quantized components are exported.
extra_state_dict (dict[str, Tensor] | None) – Extra state dictionary to add to the exported model.
max_shard_size (int | str) – Maximum size of each safetensors shard file. Defaults to “10GB”.
**kwargs – Runtime-specific post-processing options forwarded to
_postprocess_safetensors()for diffusion model exports. See its docstring for supported keys.
- export_speculative_decoding(model, dtype=None, export_dir='/tmp')
Export speculative decoding HuggingFace model checkpoint.
- Parameters:
model (Module)
dtype (dtype | None)
export_dir (Path | str)
- Return type:
None