nvalchemi.data.Dataset#
- class nvalchemi.data.Dataset(reader, *, device=None, num_workers=2)[source]#
AtomicData-native dataset that bypasses TensorDict conversion.
Wraps a
Readerand returnsAtomicDataobjects directly, with CUDA-stream prefetching support.- Parameters:
reader (Reader | ReaderProtocol) – Reader providing raw tensor dicts from a data source.
device (str | torch.device | None, default=None) – Target device.
"auto"picks CUDA if available, otherwise CPU.num_workers (int, default=2) – Thread pool size for async prefetch.
- target_device#
Resolved target device for data transfer.
- Type:
torch.device | None
- num_workers#
Number of worker threads for prefetching.
- Type:
int
Examples
>>> from nvalchemi.data.datapipes.dataset import Dataset >>> from nvalchemi.data.datapipes.backends.base import Reader >>> # Assuming a concrete Reader implementation exists: >>> # reader = MyReader("dataset.zarr") >>> # ds = Dataset(reader, device="cpu") >>> # atomic_data, meta = ds[0]
- cancel_prefetch(index=None)[source]#
Cancel pending prefetch operations.
- Parameters:
index (int | None, default=None) – Specific index to cancel, or None to cancel all.
- Return type:
None
- close()[source]#
Release resources held by the dataset.
Drains pending prefetch futures, shuts down the thread pool executor, and closes the underlying reader.
- Return type:
None
- get_metadata(index)[source]#
Return lightweight metadata for a sample without full construction.
Loads the raw tensor dictionary from the reader and extracts shape information for atom and edge counts, avoiding the overhead of full
AtomicDataconstruction and validation.- Parameters:
index (int) – Sample index.
- Returns:
(num_atoms, num_edges)for the sample.- Return type:
tuple[int, int]
- Raises:
IndexError – If index is out of range.
KeyError – If the sample dict does not contain
"atomic_numbers".
- prefetch(index, stream=None)[source]#
Submit a sample for async prefetching.
If the sample is already being prefetched, this is a no-op.
- Parameters:
index (int) – Sample index.
stream (torch.cuda.Stream | None, default=None) – CUDA stream for GPU operations.
- Return type:
None
- prefetch_batch(indices, streams=None)[source]#
Prefetch multiple samples asynchronously.
- Parameters:
indices (Sequence[int]) – Sample indices to prefetch.
streams (Sequence[torch.cuda.Stream] | None, default=None) – CUDA streams to distribute across. Streams are assigned round-robin to the indices.
- Return type:
None