Data Loading Pipeline#
The toolkit ships a data loading pipeline designed for GPU-accelerated atomistic
workloads. It is built from four composable pieces: a Reader that pulls raw
tensors from storage, a Dataset that validates them into
nvalchemi.data.AtomicData objects, a DataLoader that batches them
into nvalchemi.data.Batch objects, and an optional Sampler that
controls batching strategy. Each layer adds exactly one concern, and you can swap
any of them independently.
Note
The datapipes abstraction is shared with physicsnemo: there are some
specializations in nvalchemi for CSR-type data, but in the near-term
we will merge implementations.
Reader: raw tensor I/O#
A Reader is the simplest
abstraction in the pipeline. It knows how to load a single sample from storage and
return it as a plain dict[str, torch.Tensor] — no validation, no device
transfers, no threading. Readers are intentionally minimal so that adding a new
storage backend only requires implementing two methods:
_load_sample(index) -> dict[str, torch.Tensor]: Reads one sample into CPU tensors.__len__() -> int: Returns the total number of available samples.
The built-in reader is
AtomicDataZarrReader, which
reads from the structured Zarr stores produced by the toolkit’s
AtomicDataZarrWriter. The Zarr
layout uses separate groups for core fields, metadata, and custom attributes, and
supports soft-deletes via a validity mask.
If your data lives in a different format (HDF5, LMDB, a collection of files), you
can subclass Reader and implement _load_sample and __len__. Everything
downstream — Dataset, DataLoader, Sampler — will work without changes.
Dataset: validation and prefetching#
Dataset wraps a Reader and adds two
responsibilities:
Validation: Raw dictionaries are validated into
nvalchemi.data.AtomicDataobjects, catching schema issues early.Async prefetching: A background
ThreadPoolExecutorloads and transfers samples to the target device ahead of time, so the GPU is never starved.
from nvalchemi.data.datapipes.dataset import Dataset
from nvalchemi.data.datapipes.backends.zarr import AtomicDataZarrReader
reader = AtomicDataZarrReader("/path/to/store.zarr")
dataset = Dataset(reader=reader, device="cuda:0", num_workers=4)
# Fetch a single sample (AtomicData on GPU)
data, metadata = dataset[0]
CUDA stream prefetching#
When called by a DataLoader, the Dataset uses
prefetch() to overlap
host-to-device data transfers with compute. The DataLoader issues prefetch calls on
non-default CUDA streams; the Dataset records the transfer and synchronises the
stream before returning the data. This means the next batch is already on the GPU
while the model is processing the current one.
Lightweight metadata access#
Samplers often need to know sample sizes (how many atoms? how many edges?) before
deciding which samples to group into a batch.
get_metadata() returns
(num_atoms, num_edges) for a given index without constructing the full
AtomicData, keeping the overhead low.
DataLoader: batching and iteration#
DataLoader ties the pipeline
together. It requests indices from a Sampler, fetches AtomicData objects from
the Dataset, and collates them into nvalchemi.data.Batch objects for
consumption by the model.
from nvalchemi.data.datapipes.dataloader import DataLoader
loader = DataLoader(
dataset=dataset,
batch_size=32,
prefetch_factor=2,
num_streams=2,
)
for batch in loader:
# batch is a Batch on the dataset's target device
outputs = model(batch)
Key parameters:
Parameter |
Purpose |
|---|---|
|
Number of graphs per batch |
|
How many batches to load ahead of the current one |
|
Number of CUDA streams used for overlapping transfers |
|
Controls index ordering (defaults to sequential or random) |
Unlike PyTorch’s torch.utils.data.DataLoader, this implementation returns
nvalchemi.data.Batch objects (disjoint graphs with proper node-index
offsets) rather than generic collated tensors.
SizeAwareSampler: memory-safe batching#
For datasets where systems vary widely in size — a common situation in atomistic
ML — a fixed batch_size can either waste GPU memory (when graphs are small) or
cause out-of-memory errors (when a few large graphs land in the same batch).
SizeAwareSampler solves this with
capacity-aware bin-packing:
from nvalchemi.dynamics.sampler import SizeAwareSampler
sampler = SizeAwareSampler(
dataset=dataset,
max_atoms=4096,
max_edges=32768,
max_batch_size=64,
)
Instead of grouping a fixed count of graphs, the sampler fills each batch until
adding the next sample would exceed one of the capacity constraints (max_atoms,
max_edges, or max_batch_size). Internally it uses bin-packing by atom count: it
sorts samples into bins of similar size, then draws from bins in a way that
maximises GPU utilisation while respecting the limits.
GPU memory heuristic#
If you omit max_atoms, the sampler can estimate a safe limit from the GPU’s
available memory fraction. This is useful for workloads where the optimal batch size
depends on the hardware.
Inflight replacement#
In dynamics pipelines, systems converge and leave the batch at different times.
request_replacement() finds a
new sample whose size fits the memory slot left by a graduated system, keeping the
batch full without reallocation.
Putting it all together#
A typical end-to-end setup:
from nvalchemi.data.datapipes.backends.zarr import AtomicDataZarrReader
from nvalchemi.data.datapipes.dataset import Dataset
from nvalchemi.data.datapipes.dataloader import DataLoader
from nvalchemi.dynamics.sampler import SizeAwareSampler
reader = AtomicDataZarrReader("/path/to/store.zarr")
dataset = Dataset(reader=reader, device="cuda:0", num_workers=4)
sampler = SizeAwareSampler(dataset=dataset, max_atoms=4096)
loader = DataLoader(
dataset=dataset,
batch_size=64, # upper bound; sampler may produce smaller batches
sampler=sampler,
prefetch_factor=2,
num_streams=2,
)
for batch in loader:
# batch.num_atoms <= 4096 guaranteed
outputs = model(batch)
See also#
Storage guide: See
AtomicDataZarrWriterandAtomicDataZarrReaderfor writing and reading Zarr stores.API:
nvalchemi.datafor the full datapipe API reference.Dynamics: The Dynamics guide shows how the DataLoader and SizeAwareSampler integrate with simulation pipelines.