validate_model

Provides a function to validate a model. Runs a model forward pass on a dataset and calculates the loss, and optionally registers hooks to capture the inputs and the outputs of pytorch modules that are used for activation scoring for pruning.

TODO: Consider moving this a separate module dedicated for scoring

Uses native HuggingFace models with deci_x_patcher for heterogeneous layer configurations.

Functions

`prepare_dataloader`
`prepare_model`
`validate_model`	Validate a language model on a dataset by calculating loss and optionally capturing activations.

prepare_dataloader(args, tokenizer=None)

Parameters:

args (DictConfig)
tokenizer (PreTrainedTokenizerBase | None)

Return type:

DataLoader

prepare_model(args, descriptor, model=None)

Parameters:

args (DictConfig)
descriptor (Type[ModelDescriptor])
model (PreTrainedModel | None)

Return type:

Module

validate_model(args, model=None, tokenizer=None, target_hidden_states_per_batch=None, return_hidden_states=False, calculate_full_score_ablations=False, val_dataloader=None)

Validate a language model on a dataset by calculating loss and optionally capturing activations.

Parameters:

args (DictConfig) –
Configuration object containing the following attributes:

Model Configuration: - model_name_or_path (str): Path to model checkpoint or HuggingFace model name. Required unless model is passed directly. - model_dtype (str or torch.dtype): Model data type (e.g., “torch.bfloat16”, torch.float16). - autocast_dtype (str or torch.dtype): Autocast data type for mixed precision.

Dataset Configuration: - dataset_path (str): Path to the validation dataset. - tokenizer_name (str, optional): Tokenizer name/path. Uses model_name_or_path if not specified. - data_column (str): Column name in dataset containing text data. - block_size (int): Maximum sequence length for tokenization. - eval_samples (int, optional): Number of samples to evaluate. Uses all if None. - val_dataset_name (str): Name of validation dataset split. - source_datasets_to_discard (list[str], optional): List of source datasets to exclude. - load_dataset_fn (callable, optional): Custom function to load the dataset.

Data Processing: - micro_batch_size (int): Batch size for evaluation. - seed (int): Random seed for reproducibility. - shuffle_seed (int, optional): Seed for shuffling data. Uses seed if None. - varlen (bool): Enable variable-length sequences. - bos_rate (float): Rate of adding BOS token. - fim_rate (float): Fill-in-the-middle rate for code completion tasks. - fim_spm_rate (float): SPM-based fill-in-the-middle rate.

Activation Hooks: - activations_log_dir (str, optional): Directory to log activation scores. If provided, hooks will be registered to capture activations. - activation_hooks_kwargs (str or dict, optional): Arguments for activation hooks. If string, comma-separated format: “arg1=val1,arg2=val2”.

Execution Options: - calc_losses_on_cpu (bool): Calculate losses on CPU to avoid OOM. Very slow, not recommended. - write_results (bool): Write validation results to file.
model (PreTrainedModel | None) – Pre-loaded model. If None, will be loaded from args.model_name_or_path.
tokenizer (PreTrainedTokenizerBase | None) – Pre-loaded tokenizer. If None, will be loaded based on args.
target_hidden_states_per_batch (list[Tensor] | None) – Target hidden states for pipeline parallel evaluation.
return_hidden_states (bool) – Whether to return hidden states from the model.
calculate_full_score_ablations (bool) – Calculate comprehensive teacher similarity scores. False calculates only a small suite for efficiency.
val_dataloader (DataLoader | None) – Pre-created validation dataloader. If None, will be created from args.

Returns:

losses: Dictionary mapping loss names to loss statistics (avg, per_sample).
hidden_states_per_batch: Hidden states and LM head outputs if return_hidden_states is True, else None.

Returns (None, None) if not on master rank.

Return type:

A tuple containing