nemotron_vlm_dataset_utils

Nemotron VLM dataset utilities.

This module contains the Nemotron-VLM-Dataset-v2 specific logic: - Subsets can store images in media/shard_*.tar (images only) - Prompts/messages live in <subset>/<subset>.jsonl and reference the image filename (e.g. 292180.png)

We join the tar images with the JSONL messages by the shared filename and yield samples compatible with our VLM calibration pipeline.

Classes

NemotronTarPlusJsonlIterable

Join Nemotron VLM media/shard_*.tar (images-only) with <subset>/<subset>.jsonl (messages).

Functions

extract_first_image_from_messages

Best-effort extraction of an image reference from Nemotron-style messages.

list_repo_files_cached

List files in a HuggingFace repo (cached).

class NemotronTarPlusJsonlIterable

Bases: IterableDataset

Join Nemotron VLM media/shard_*.tar (images-only) with <subset>/<subset>.jsonl (messages).

__init__(repo_id, subsets, shard_paths, num_samples, seed, shuffle_buffer_size, max_shards)

Create an iterable dataset for Nemotron-VLM-Dataset-v2.

Parameters:
  • repo_id (str) – Dataset repo id, e.g. “nvidia/Nemotron-VLM-Dataset-v2”.

  • subsets (list[str]) – Subset names to draw from (e.g., “sparsetables”).

  • shard_paths (list[str]) – Tar shard paths under <subset>/media/.

  • num_samples (int) – Total number of samples to yield.

  • seed (int) – RNG seed for sampling.

  • shuffle_buffer_size (int) – Unused for now (kept for API compatibility).

  • max_shards (int | None) – Max number of shards to use per subset (limits downloads).

extract_first_image_from_messages(messages)

Best-effort extraction of an image reference from Nemotron-style messages.

Parameters:

messages (Any)

Return type:

Any

list_repo_files_cached(repo_id, repo_type='dataset')

List files in a HuggingFace repo (cached).

Parameters:
  • repo_id (str) – HF repo id (e.g., a dataset repo).

  • repo_type (str) – HF repo type, usually “dataset” here.

Return type:

list[str]