AtomicData and Batch#
The ALCHEMI Toolkit represents molecular systems as graphs: atoms are nodes, and
optional edges (e.g. bonds or radius-cutoff neighbors) connect them. The
nvalchemi.data.AtomicData class holds a single graph (one molecule or
structure), and nvalchemi.data.Batch batches many such graphs into one
structure for efficient GPU-friendly training and inference.
AtomicData: a single graph#
nvalchemi.data.AtomicData is a Pydantic model that stores:
Required:
positions(shape[n_nodes, 3]) andatomic_numbers(shape[n_nodes]).Optional node-level: e.g.
atomic_masses,forces,velocities,node_attrs.Optional edge-level:
edge_index(shape[2, n_edges]) and edge attributes such asshiftsfor periodicity.Optional system-level:
energies,cell,pbc,stresses,virials, etc.
All tensor fields use PyTorch tensors, so you can move them to GPU with .to(device) or
use the mixin method nvalchemi.data.data.DataMixin.to() for device/dtype changes.
Example:
import torch
from nvalchemi.data import AtomicData
positions = torch.randn(5, 3)
atomic_numbers = torch.tensor([1, 6, 6, 1, 8], dtype=torch.long)
data = AtomicData(positions=positions, atomic_numbers=atomic_numbers)
# Optional: add system-level labels
data = AtomicData(
positions=positions,
atomic_numbers=atomic_numbers,
energies=torch.tensor([[0.0]]),
)
Properties such as num_nodes, num_edges, and device are available; optional
fields default to None when not provided.
Batch: multiple graphs#
nvalchemi.data.Batch is built from a list of nvalchemi.data.AtomicData
instances. Node tensors are concatenated along the first dimension; edge tensors are
concatenated with node-index offsets so each graph’s edges refer to the correct atoms.
System-level tensors are stacked so that the first dimension is the number of graphs.
Build a batch:
nvalchemi.data.batch.Batch.from_data_list()\ (data_list).Access batch size:
num_graphs,num_nodes,num_edges,num_nodes_list,num_edges_list.Recover a single graph:
nvalchemi.data.batch.Batch.get_data()\ (index).Recover all graphs:
nvalchemi.data.batch.Batch.to_data_list()\ ().
Example:
import torch
from nvalchemi.data import AtomicData, Batch
data_list = [
AtomicData(
positions=torch.randn(2, 3),
atomic_numbers=torch.ones(2, dtype=torch.long),
energies=torch.zeros(1, 1),
),
AtomicData(
positions=torch.randn(3, 3),
atomic_numbers=torch.ones(3, dtype=torch.long),
energies=torch.zeros(1, 1),
),
]
batch = Batch.from_data_list(data_list)
print(batch.num_graphs, batch.num_nodes, batch.num_nodes_list) # 2, 5, [2, 3]
first = batch.get_data(0)
again = batch.to_data_list()
Indexing and selection#
Batch supports bracket indexing that mirrors familiar Python and PyTorch
conventions. The type of index determines what you get back:
Index type |
Returns |
Example |
|---|---|---|
|
The raw tensor attribute by name |
|
|
A single |
|
|
A new |
|
|
A new |
|
When selecting multiple graphs (slice, tensor, or list), the underlying
index_select() method operates directly on the
concatenated storage — it slices segments and adjusts edge_index offsets without
reconstructing individual AtomicData objects, so it is efficient even for large
batches.
# Select a sub-batch of graphs 0 and 2
sub = batch[torch.tensor([0, 2])]
print(sub.num_graphs) # 2
# String indexing accesses the raw concatenated tensor
all_positions = batch["positions"] # shape (total_nodes, 3)
Adding keys to a batch#
You can add new tensor keys (e.g. model outputs or extra labels) at node, edge, or
system level with nvalchemi.data.batch.Batch.add_key(). The new key is then
available on the underlying storage and when you call nvalchemi.data.batch.Batch.get_data()
or nvalchemi.data.batch.Batch.to_data_list(), so each nvalchemi.data.AtomicData
gets the correct slice.
batch.add_key("node_feat", [torch.randn(2, 4), torch.randn(3, 4)], level="node")
batch.add_key(
"energies",
[torch.tensor([[0.1]]), torch.tensor([[0.2]])],
level="system",
overwrite=True,
)
list_of_data = batch.to_data_list()
# list_of_data[i] now has "node_feat" and "energies" with the right shapes.
Device and serialization#
Device: Use
nvalchemi.data.batch.Batch.to()\ (device) or the mixinnvalchemi.data.data.DataMixin.to()onnvalchemi.data.AtomicData. The batch implementation delegates to the underlying storage for efficiency.Serialization:
nvalchemi.data.AtomicDatasupports Pydantic serialization (e.g.model_dump,model_dump_json). Tensor fields are serialized to lists in JSON mode.
How Batch stores data internally#
When you call nvalchemi.data.batch.Batch.from_data_list(), the resulting
Batch does not simply stack all tensors along a new “batch” axis. Different kinds
of data need different layouts, and the toolkit uses a storage model that reflects
this.
Every tensor attribute belongs to one of three levels:
Level |
Storage class |
Shape convention |
Examples |
|---|---|---|---|
system |
|
First dim = number of graphs |
|
atoms |
|
Concatenated across graphs |
|
edges |
|
Concatenated across graphs |
|
Uniform storage is straightforward: every graph contributes exactly one row, so
the i-th graph’s data is always at index i. System-level properties like the
simulation cell or total energy work this way.
Segmented storage is designed for variable-length data. Positions, for example,
are concatenated into a single tensor of shape (total_nodes, 3). To know where each
graph’s atoms start and end, the storage tracks segment_lengths and a pointer array
batch_ptr. The i-th graph’s nodes live at positions[batch_ptr[i]:batch_ptr[i+1]].
Edge data works the same way, with node-index offsets automatically applied to
edge_index so that each graph’s edges still point to the correct atoms in the
flattened array.
The mapping from attribute name to level is determined by a
DEFAULT_ATTRIBUTE_MAP. When you add a new key with
add_key(), you explicitly specify the level
("node", "edge", or "system") so the batch knows how to slice it back out when
you call get_data().
Pre-allocated batches and the buffer API#
For training and data loading, from_data_list creates a batch that fits its data
exactly. But in high-throughput dynamics simulations, you often need a fixed-capacity
buffer that you fill and drain without reallocating memory: this abstraction is
used in the dynamics pipeline abstraction for point-to-point data sample passing,
which bypasses the need for host and/or file I/O.
Creating an empty buffer#
nvalchemi.data.batch.Batch.empty() allocates a batch with room for a
specified number of systems, nodes, and edges, but with zero graphs initially.
It requires a template (AtomicData or
Batch) that defines which keys to allocate and their
schema:
template = AtomicData(
positions=torch.zeros(1, 3),
atomic_numbers=torch.zeros(1, dtype=torch.long),
forces=torch.zeros(1, 3),
energies=torch.zeros(1, 1),
cell=torch.zeros(1, 3, 3),
pbc=torch.zeros(1, 3, dtype=torch.bool),
)
buffer = Batch.empty(
num_systems=64,
num_nodes=4096,
num_edges=32768,
template=template,
device="cuda",
)
All tensors are pre-allocated at the given capacity. The batch’s num_graphs starts
at zero.
Filling the buffer with put#
nvalchemi.data.batch.Batch.put() copies selected graphs from a source batch
into the buffer. A boolean mask selects which graphs to copy:
# Copy the first two graphs from incoming_batch into buffer
mask = torch.tensor([True, True, False, False])
buffer.put(incoming_batch, mask)
The method performs capacity checks to make sure the incoming segments fit, and uses optimized kernels for the data movement.
Compacting with defrag#
After graphs have been consumed (e.g. copied out to another stage), you remove them
with nvalchemi.data.batch.Batch.defrag(). This compacts the remaining graphs
to the front of the buffer so that freed capacity is available again:
# Mark which graphs have been copied out
copied_mask = torch.tensor([True, False, True])
buffer.defrag(copied_mask=copied_mask)
Resetting with zero#
nvalchemi.data.batch.Batch.zero() resets the batch to zero graphs while
keeping the allocated memory in place — useful at the start of a new epoch or
pipeline iteration.
These operations (empty / put / defrag / zero) form the backbone of the
dynamics pipeline’s inflight batching, where systems enter and leave a running
simulation at different times.
ASE Atoms interoperability#
The Atomic Simulation Environment (ASE) is the most widely-used Python library for representing and manipulating atomistic systems. The toolkit provides a conversion path so you can move data between ASE and ALCHEMI seamlessly.
Converting ASE Atoms to AtomicData#
nvalchemi.data.AtomicData.from_atoms() accepts an ase.Atoms object and
returns an nvalchemi.data.AtomicData:
from ase.build import molecule
from nvalchemi.data import AtomicData
atoms = molecule("H2O")
data = AtomicData.from_atoms(atoms, device="cpu")
The conversion maps ASE fields to ALCHEMI fields:
ASE source |
AtomicData field |
Notes |
|---|---|---|
|
|
|
|
|
|
|
|
Reshaped to |
|
|
Reshaped to |
|
|
Default key: |
|
|
Default key: |
|
|
Voigt vector converted to |
|
|
Voigt vector converted to |
|
|
0 = GAS, 1 = SURFACE, 2+ = BULK |
|
|
|
|
preserved |
Filtered to tensors, arrays, and floats |
Keyword arguments (energy_key, forces_key, etc.) let you adapt to different
naming conventions in your ASE dataset.
Building a Batch from a list of Atoms#
There is no special bulk constructor — compose the two operations:
from ase.build import molecule
from nvalchemi.data import AtomicData, Batch
atoms_list = [molecule("H2O"), molecule("CH4")]
batch = Batch.from_data_list([AtomicData.from_atoms(a) for a in atoms_list])
Converting back to ASE Atoms#
The core library does not provide a to_atoms method, since the reverse mapping is
application-specific (e.g. which info keys to preserve, how to handle missing
fields). The examples directory includes a utility function that demonstrates the
reconstruction:
# From examples/04_ase_dynamics_example.py
from ase import Atoms
def data_to_atoms(data: AtomicData) -> Atoms:
return Atoms(
numbers=data.atomic_numbers.cpu().numpy(),
positions=data.positions.cpu().numpy(),
cell=data.cell.squeeze(0).cpu().numpy() if data.cell is not None else None,
pbc=data.pbc.squeeze(0).cpu().numpy() if data.pbc is not None else None,
)
Tip
Converting a Batch to ase.Atoms should convert to AtomicData first
via Batch.to_data_list, and loop over individual AtomicData
entries then.
See also#
Examples: The gallery includes AtomicData and Batch: Graph-structured molecular data (
01_data_example.py) for a runnable script.API:
nvalchemi.datafor the full API of AtomicData, Batch, and the zarr-based reader/writer and dataloader.