model_utils

Utility functions for model type detection and classification.

MODEL_NAME_TO_TYPE={'GPT2': 'gpt', 'Mllama': 'mllama', 'Llama4': 'llama4', 'Llama': 'llama', 'Mistral': 'llama', 'GPTJ': 'gptj', 'FalconForCausalLM': 'falcon', 'RWForCausalLM': 'falcon', 'baichuan': 'baichuan', 'MPT': 'mpt', 'Bloom': 'bloom', 'ChatGLM': 'chatglm', 'QWen': 'qwen', 'RecurrentGemma': 'recurrentgemma', 'Gemma3': 'gemma3', 'Gemma2': 'gemma2', 'Gemma': 'gemma', 'phi3small': 'phi3small', 'phi3': 'phi3', 'PhiMoEForCausalLM': 'phi3', 'Phi4MMForCausalLM': 'phi4mm', 'phi': 'phi', 'TLGv4ForCausalLM': 'phi', 'MixtralForCausalLM': 'llama', 'ArcticForCausalLM': 'llama', 'StarCoder': 'gpt', 'Dbrx': 'dbrx', 'T5': 't5', 'Bart': 'bart', 'GLM': 'glm', 'InternLM2ForCausalLM': 'internlm', 'ExaoneForCausalLM': 'exaone', 'Nemotron': 'gpt', 'Deepseek': 'deepseek', 'Whisper': 'whisper', 'gptoss': 'gptoss'}

Functions

`get_language_model_from_vl`	Extract the language model lineage from a Vision-Language Model (VLM).
`get_model_type`	Try get the model type from the model name.
`is_multimodal_model`	Check if a model is a Vision-Language Model (VLM) or multimodal model.

get_language_model_from_vl(model)

Extract the language model lineage from a Vision-Language Model (VLM).

This function handles the common patterns for accessing the language model component in various VLM architectures. It checks multiple possible locations where the language model might be stored.

Parameters:: model – The VLM model instance to extract the language model from
Returns:: the lineage path towards the language model
Return type:: list

Examples

>>> # For LLaVA-style models
>>> lineage = get_language_model_from_vl(vlm_model)
>>> # lineage[0] is vlm_model
>>> # lineage[1] is vlm_model.language_model

get_model_type(model): Try get the model type from the model name. If not found, return None.

is_multimodal_model(model)

Check if a model is a Vision-Language Model (VLM) or multimodal model.

This function detects various multimodal model architectures by checking for: - Standard vision configurations (vision_config) - Language model attributes (language_model) - Specific multimodal model types (phi4mm) - Vision LoRA configurations - Audio processing capabilities - Image embedding layers

Parameters:: model – The HuggingFace model instance to check
Returns:: True if the model is detected as multimodal, False otherwise
Return type:: bool

Examples

>>> model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")
>>> is_multimodal_model(model)
True

>>> model = AutoModelForCausalLM.from_pretrained("microsoft/Phi-4-multimodal-instruct")
>>> is_multimodal_model(model)
True