dataset_utils

Utility functions for getting samples and forward loop function for different datasets.

Functions

`create_forward_loop`	Creates and returns a forward loop function configured for a specific model, dataset, and tokenizer.
`download_hf_dataset_as_jsonl`	Download a Hugging Face dataset and save as JSONL files.
`get_dataset_dataloader`	Get a dataloader with the dataset name and tokenizer of the target model.
`get_dataset_samples`	Load a portion of a dataset with the dataset name and a given size.
`get_jsonl_text_samples`	Load up to `num_samples` entries from a JSONL file using the `text` field.
`get_max_batch_size`	Get the maximum batch size that can be used for the model.
`get_supported_datasets`	Retrieves a list of datasets supported.

create_forward_loop(model=None, dataset_name='cnn_dailymail', tokenizer=None, batch_size=1, num_samples=512, max_sample_length=512, device=None, include_labels=False, dataloader=None, allowed_non_tensor_keys=None)

Creates and returns a forward loop function configured for a specific model, dataset, and tokenizer.

This function initializes a forward loop function tailored to process batches of data from the specified dataset using the given model and tokenizer. The forward loop function, when called, iterates over the dataset, applies the tokenizer to prepare the input data, feeds it into the model, and returns the model’s predictions.

Parameters:

model (Module | None) – The PyTorch model for inference.
dataset_name (str) – The name of the dataset to be used. Must be one of the datasets in get_supported_datasets().
tokenizer (PreTrainedTokenizerBase | None) – The tokenizer used to preprocess text data into a format suitable for the model.
batch_size (int) – Batch size of the returned dataloader. If 0 is provided, we auto determine the batch_size.
num_samples (int) – Number of samples from the dataset.
max_sample_length (int) – Maximum length of a sample.
device (str | None) – Target device for the returned dataloader.
include_labels (bool) – Whether to include labels in the dataloader.
dataloader (DataLoader | None) – If provided, use the provided dataloader instead.
allowed_non_tensor_keys (set | None) – Set of key names whose batch values may be non-tensor types. Useful when the dataloader yields batches with non-standard fields (e.g., nested model outputs).

Return type:

Callable

Example usage for quantization:

import modelopt.torch.quantization as mtq
from modelopt.torch.utils import create_forward_loop

# Initialize model and tokenizer
# ...

# Create forward loop for calibration
forward_loop = create_forward_loop(
    model=model, dataset_name="cnn_dailymail", tokenizer=tokenizer
)

# Quantize the model with the calibration dataset
mtq.quantize(model, quant_cfg, forward_loop=forward_loop)

Returns:

A forward loop function that can be called with no arguments. When called, this function iterates over: the dataset specified by dataset_name.

Parameters:

model (Module | None)
dataset_name (str)
tokenizer (PreTrainedTokenizerBase | None)
batch_size (int)
num_samples (int)
max_sample_length (int)
device (str | None)
include_labels (bool)
dataloader (DataLoader | None)
allowed_non_tensor_keys (set | None)

Return type:

Callable

download_hf_dataset_as_jsonl(dataset_name, output_dir, json_keys=['text'], name=None, split=None, max_samples_per_split=None, num_proc=None)

Download a Hugging Face dataset and save as JSONL files.

Parameters:

dataset_name (str) – Name or HuggingFace path of the dataset to download
output_dir (str | Path) – Directory to save the JSONL files
json_keys (str | list[str]) – Key or list of keys to extract from the dataset. Defaults to [“text”].
name (str | None) – Name of the subset to download
split (str | None) – Split of the dataset to download. Defaults to None (all splits).
max_samples_per_split (int | None) – Maximum number of samples to download per split. Defaults to None.
num_proc (int | None) – Number of processes to use for parallel processing. Defaults to None.

Returns:

List of paths to downloaded JSONL files.

Return type:

list[str]

get_dataset_dataloader(dataset_name='cnn_dailymail', tokenizer=None, batch_size=1, num_samples=512, max_sample_length=512, device=None, include_labels=False, apply_chat_template=False)

Get a dataloader with the dataset name and tokenizer of the target model.

Parameters:

dataset_name (str | list[str]) – Name of the dataset to load, a path to a .jsonl file, or a list mixing the two. Each entry is loaded via get_dataset_samples() and the resulting samples are concatenated before tokenization. num_samples may be an int (applied to a single source) or a list aligned with dataset_name.
tokenizer (PreTrainedTokenizerBase | None) – Instance of HuggingFace tokenizer.
batch_size (int) – Batch size of the returned dataloader.
num_samples (int | list[int]) – Number of samples from the dataset.
max_sample_length (int) – Maximum length of a sample.
device (device | None) – Target device for the returned dataloader.
include_labels (bool) – Whether to include labels in the dataloader.
apply_chat_template (bool) – Whether to apply the chat template to the samples (if supported by the dataset).

Returns:

An instance of dataloader.

Return type:

DataLoader

get_dataset_samples(dataset_name, num_samples, *, apply_chat_template=False, tokenizer=None, split=None)

Load a portion of a dataset with the dataset name and a given size.

Supports both registered datasets (in SUPPORTED_DATASET_CONFIG) and arbitrary HuggingFace datasets. Unregistered datasets are auto-detected by column names: messages/conversations (chat), prompt, text, or input.

Parameters:

dataset_name (str) – Name or HuggingFace path of the dataset to load, a local directory path, or a path to a .jsonl file. For local directory paths, the predefined config from SUPPORTED_DATASET_CONFIG is matched if the base folder name matches a registered key (e.g. /hf-local/abisee/cnn_dailymail matches cnn_dailymail key). For .jsonl paths, the file is first loaded via HuggingFace’s json builder and routed through the same auto-preprocess path as unregistered HF datasets so chat / prompt / text columns are handled consistently with live HF datasets. If that path fails on JSON parsing or PyArrow schema unification, it falls back to a line-by-line reader that extracts the legacy text field for backward compatibility. The fallback is also used when the optional datasets package isn’t installed, preserving legacy plain-.jsonl workflows in base installations. Local JSONL files only expose the train split; passing any other split raises.
num_samples (int) – Number of samples to load from the dataset.
apply_chat_template (bool) – Whether to apply the chat template to the samples (if supported by the dataset). For unregistered datasets with a messages column, chat template is always applied regardless of this flag.
tokenizer (PreTrainedTokenizerBase | None) – Tokenizer to use for applying the chat template to the samples. No tokenization is done and plain text is still returned.
split (str | list[str] | None) – Override the split(s) to load. Accepts a single split name or a list. If None, uses the splits defined in SUPPORTED_DATASET_CONFIG for registered datasets, or ["train"] for unregistered datasets.

Returns:

The list of samples.

Return type:

Samples

get_jsonl_text_samples(jsonl_path, num_samples, key='text')

Load up to num_samples entries from a JSONL file using the text field.

Each non-empty line must be a JSON object containing a text field.

Parameters:

jsonl_path (str)
num_samples (int)
key (str)

Return type:

list[str]

get_max_batch_size(model, max_sample_length=512, sample_memory_usage_ratio=1.0, sample_input_single_batch=None, enable_grad=False)

Get the maximum batch size that can be used for the model.

Parameters:

model (Module)
max_sample_length (int)
sample_memory_usage_ratio (float)
sample_input_single_batch (Tensor | None)
enable_grad (bool)

get_supported_datasets()

Retrieves a list of datasets supported.

Returns:: A list of strings, where each string is the name of a supported dataset.
Return type:: list[str]

Example usage:

from modelopt.torch.utils import get_supported_datasets

print("Supported datasets:", get_supported_datasets())