model_utils

Utility functions for model type detection and classification.

MODEL_NAME_TO_TYPE={'GPT2': 'gpt', 'Mllama': 'mllama', 'Llama4': 'llama4', 'Llama': 'llama', 'Mistral': 'llama', 'GPTJ': 'gptj', 'FalconForCausalLM': 'falcon', 'RWForCausalLM': 'falcon', 'baichuan': 'baichuan', 'MPT': 'mpt', 'Bloom': 'bloom', 'ChatGLM': 'chatglm', 'QWen': 'qwen', 'RecurrentGemma': 'recurrentgemma', 'Gemma3': 'gemma3', 'Gemma2': 'gemma2', 'Gemma': 'gemma', 'phi3small': 'phi3small', 'phi3': 'phi3', 'PhiMoEForCausalLM': 'phi3', 'Phi4MMForCausalLM': 'phi4mm', 'phi': 'phi', 'TLGv4ForCausalLM': 'phi', 'MixtralForCausalLM': 'llama', 'ArcticForCausalLM': 'llama', 'StarCoder': 'gpt', 'Dbrx': 'dbrx', 'T5': 't5', 'Bart': 'bart', 'GLM': 'glm', 'InternLM2ForCausalLM': 'internlm', 'ExaoneForCausalLM': 'exaone', 'Nemotron': 'gpt', 'Deepseek': 'deepseek', 'Whisper': 'whisper', 'gptoss': 'gptoss'}

Functions

get_language_model_from_vl

Extract the language model component from a Vision-Language Model (VLM).

get_model_type

Try get the model type from the model name.

is_multimodal_model

Check if a model is a Vision-Language Model (VLM) or multimodal model.

get_language_model_from_vl(model)

Extract the language model component from a Vision-Language Model (VLM).

This function handles the common patterns for accessing the language model component in various VLM architectures. It checks multiple possible locations where the language model might be stored.

Parameters:

model – The VLM model instance to extract the language model from

Returns:

(language_model, parent_model) where:
  • language_model: The extracted language model component, or None if not found

  • parent_model: The parent model containing the language_model attribute

Return type:

tuple

Examples

>>> # For LLaVA-style models
>>> lang_model, parent = get_language_model_from_vl(vlm_model)
>>> if lang_model is not None:
...     # Work with the language model component
...     quantized_lang_model = quantize(lang_model)
...     # Update the parent model
...     parent.language_model = quantized_lang_model
get_model_type(model)

Try get the model type from the model name. If not found, return None.

is_multimodal_model(model)

Check if a model is a Vision-Language Model (VLM) or multimodal model.

This function detects various multimodal model architectures by checking for: - Standard vision configurations (vision_config) - Language model attributes (language_model) - Specific multimodal model types (phi4mm) - Vision LoRA configurations - Audio processing capabilities - Image embedding layers

Parameters:

model – The HuggingFace model instance to check

Returns:

True if the model is detected as multimodal, False otherwise

Return type:

bool

Examples

>>> model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")
>>> is_multimodal_model(model)
True
>>> model = AutoModelForCausalLM.from_pretrained("microsoft/Phi-4-multimodal-instruct")
>>> is_multimodal_model(model)
True