vlm_dataset_utils

Utility functions for getting samples and dataloader for different VLM calibration datasets.

This module supports both: - Small non-streaming VLM datasets (e.g., ScienceQA) - Large streaming VLM datasets (e.g., Nemotron-VLM-Dataset-v2) where we want to avoid downloading everything.

Functions

get_supported_vlm_datasets

Retrieves a list of vlm datasets supported.

get_vlm_dataset_dataloader

Get a dataloader with the dataset name and processor of the target model.

get_supported_vlm_datasets()

Retrieves a list of vlm datasets supported.

Returns:

A list of strings, where each string is the name of a supported dataset.

Return type:

list[str]

Example usage:

from modelopt.torch.utils import get_supported_vlm_datasets

print("Supported datasets:", get_supported_vlm_datasets())
get_vlm_dataset_dataloader(dataset_name='scienceqa', processor=None, batch_size=1, num_samples=512, device=None, max_length=None, require_image=True, subsets=None, shuffle_buffer_size=10000, seed=42, image_root=None, use_media_shards=True, max_shards=None)

Get a dataloader with the dataset name and processor of the target model.

Parameters:
  • dataset_name (str) – Name of the dataset to load.

  • processor (Any) – Processor used for encoding images and text data.

  • batch_size (int) – Batch size of the returned dataloader.

  • num_samples (int) – Number of samples from the dataset.

  • device (str | device | None) – Device to move returned tensors to. If None, keep on CPU.

  • max_length (int | None) – Optional max length for text tokenization (if supported by the processor).

  • require_image (bool) – If True, keep only samples that have an image field.

  • subsets (list[str] | None)

  • shuffle_buffer_size (int)

  • seed (int)

  • image_root (str | Path | None)

  • use_media_shards (bool)

  • max_shards (int | None)

Returns:

An instance of dataloader.

Return type:

DataLoader