nemotron_vlm_dataset_utils
Nemotron VLM dataset utilities.
This module contains the Nemotron-VLM-Dataset-v2 specific logic: - Subsets can store images in media/shard_*.tar (images only) - Prompts/messages live in <subset>/<subset>.jsonl and reference the image filename (e.g. 292180.png)
We join the tar images with the JSONL messages by the shared filename and yield samples compatible with our VLM calibration pipeline.
Classes
Join Nemotron VLM media/shard_*.tar (images-only) with <subset>/<subset>.jsonl (messages). |
Functions
Best-effort extraction of an image reference from Nemotron-style messages. |
|
List files in a HuggingFace repo (cached). |
- class NemotronTarPlusJsonlIterable
Bases:
IterableDatasetJoin Nemotron VLM media/shard_*.tar (images-only) with <subset>/<subset>.jsonl (messages).
- __init__(repo_id, subsets, shard_paths, num_samples, seed, shuffle_buffer_size, max_shards)
Create an iterable dataset for Nemotron-VLM-Dataset-v2.
- Parameters:
repo_id (str) – Dataset repo id, e.g. “nvidia/Nemotron-VLM-Dataset-v2”.
subsets (list[str]) – Subset names to draw from (e.g., “sparsetables”).
shard_paths (list[str]) – Tar shard paths under <subset>/media/.
num_samples (int) – Total number of samples to yield.
seed (int) – RNG seed for sampling.
shuffle_buffer_size (int) – Unused for now (kept for API compatibility).
max_shards (int | None) – Max number of shards to use per subset (limits downloads).
- extract_first_image_from_messages(messages)
Best-effort extraction of an image reference from Nemotron-style messages.
- Parameters:
messages (Any)
- Return type:
Any
- list_repo_files_cached(repo_id, repo_type='dataset')
List files in a HuggingFace repo (cached).
- Parameters:
repo_id (str) – HF repo id (e.g., a dataset repo).
repo_type (str) – HF repo type, usually “dataset” here.
- Return type:
list[str]