nvalchemi.data.Dataset#

class nvalchemi.data.Dataset(reader, *, device=None, num_workers=2)[source]#

AtomicData-native dataset that bypasses TensorDict conversion.

Wraps a Reader and returns AtomicData objects directly, with CUDA-stream prefetching support.

Parameters:
  • reader (Reader | ReaderProtocol) – Reader providing raw tensor dicts from a data source.

  • device (str | torch.device | None, default=None) – Target device. "auto" picks CUDA if available, otherwise CPU.

  • num_workers (int, default=2) – Thread pool size for async prefetch.

reader#

The underlying data reader.

Type:

Reader | ReaderProtocol

target_device#

Resolved target device for data transfer.

Type:

torch.device | None

num_workers#

Number of worker threads for prefetching.

Type:

int

Examples

>>> from nvalchemi.data.datapipes.dataset import Dataset
>>> from nvalchemi.data.datapipes.backends.base import Reader
>>> # Assuming a concrete Reader implementation exists:
>>> # reader = MyReader("dataset.zarr")
>>> # ds = Dataset(reader, device="cpu")
>>> # atomic_data, meta = ds[0]
cancel_prefetch(index=None)[source]#

Cancel pending prefetch operations.

Parameters:

index (int | None, default=None) – Specific index to cancel, or None to cancel all.

Return type:

None

close()[source]#

Release resources held by the dataset.

Drains pending prefetch futures, shuts down the thread pool executor, and closes the underlying reader.

Return type:

None

get_metadata(index)[source]#

Return lightweight metadata for a sample without full construction.

Loads the raw tensor dictionary from the reader and extracts shape information for atom and edge counts, avoiding the overhead of full AtomicData construction and validation.

Parameters:

index (int) – Sample index.

Returns:

(num_atoms, num_edges) for the sample.

Return type:

tuple[int, int]

Raises:
  • IndexError – If index is out of range.

  • KeyError – If the sample dict does not contain "atomic_numbers".

prefetch(index, stream=None)[source]#

Submit a sample for async prefetching.

If the sample is already being prefetched, this is a no-op.

Parameters:
  • index (int) – Sample index.

  • stream (torch.cuda.Stream | None, default=None) – CUDA stream for GPU operations.

Return type:

None

prefetch_batch(indices, streams=None)[source]#

Prefetch multiple samples asynchronously.

Parameters:
  • indices (Sequence[int]) – Sample indices to prefetch.

  • streams (Sequence[torch.cuda.Stream] | None, default=None) – CUDA streams to distribute across. Streams are assigned round-robin to the indices.

Return type:

None