AtomicData and Batch#
The ALCHEMI Toolkit represents molecular systems as graphs: atoms are nodes, and
optional edges (e.g. bonds or radius-cutoff neighbors) connect them. The
nvalchemi.data.AtomicData class holds a single graph (one molecule or
structure), and nvalchemi.data.Batch batches many such graphs into one
structure for efficient GPU-friendly training and inference.
Tip
AI coding assistant? Load the nvalchemi-data-structures
agent skill for concise instructions on creating,
manipulating, and batching AtomicData objects.
AtomicData: a single graph#
nvalchemi.data.AtomicData is a Pydantic model that stores:
Required:
positions(shape[n_nodes, 3]) andatomic_numbers(shape[n_nodes]).Optional node-level: e.g.
atomic_masses,forces,velocities,node_attrs.Optional edge-level:
neighbor_list(shape[n_edges, 2]) and edge attributes such asshifts(Cartesian displacements) andneighbor_list_shifts(integer lattice indices) for periodicity.Optional system-level:
energy,cell,pbc,stress,virial, etc.
All tensor fields use PyTorch tensors, so you can move them to GPU with .to(device) or
use the mixin method nvalchemi.data.data.DataMixin.to() for device/dtype changes.
Example:
import torch
from nvalchemi.data import AtomicData
positions = torch.randn(5, 3)
atomic_numbers = torch.tensor([1, 6, 6, 1, 8], dtype=torch.long)
data = AtomicData(positions=positions, atomic_numbers=atomic_numbers)
# Optional: add system-level labels
data = AtomicData(
positions=positions,
atomic_numbers=atomic_numbers,
energy=torch.tensor([[0.0]]),
)
Properties such as num_nodes, num_edges, and device are available; optional
fields default to None when not provided.
Batch: multiple graphs#
nvalchemi.data.Batch is built from a list of nvalchemi.data.AtomicData
instances. Node tensors are concatenated along the first dimension; edge tensors are
concatenated with node-index offsets so each graph’s edges refer to the correct atoms.
System-level tensors are stacked so that the first dimension is the number of graphs.
Build a batch:
nvalchemi.data.batch.Batch.from_data_list()\ (data_list).Access batch size:
num_graphs,num_nodes,num_edges,num_nodes_list,num_edges_list.Recover a single graph:
nvalchemi.data.batch.Batch.get_data()\ (index).Recover all graphs:
nvalchemi.data.batch.Batch.to_data_list()\ ().
Example:
import torch
from nvalchemi.data import AtomicData, Batch
data_list = [
AtomicData(
positions=torch.randn(2, 3),
atomic_numbers=torch.ones(2, dtype=torch.long),
energy=torch.zeros(1, 1),
),
AtomicData(
positions=torch.randn(3, 3),
atomic_numbers=torch.ones(3, dtype=torch.long),
energy=torch.zeros(1, 1),
),
]
batch = Batch.from_data_list(data_list)
print(batch.num_graphs, batch.num_nodes, batch.num_nodes_list) # 2, 5, [2, 3]
first = batch.get_data(0)
again = batch.to_data_list()
Indexing and selection#
Batch supports bracket indexing that mirrors familiar Python and PyTorch
conventions. The type of index determines what you get back:
Index type |
Returns |
Example |
|---|---|---|
|
The raw tensor attribute by name |
|
|
A single |
|
|
A new |
|
|
A new |
|
When selecting multiple graphs (slice, tensor, or list), the underlying
index_select() method operates directly on the
concatenated storage — it slices segments and adjusts neighbor_list offsets without
reconstructing individual AtomicData objects, so it is efficient even for large
batches.
# Select a sub-batch of graphs 0 and 2
sub = batch[torch.tensor([0, 2])]
print(sub.num_graphs) # 2
# String indexing accesses the raw concatenated tensor
all_positions = batch["positions"] # shape (total_nodes, 3)
Adding keys to a batch#
You can add new tensor keys (e.g. model outputs or extra labels) at node, edge, or
system level with nvalchemi.data.batch.Batch.add_key(). The new key is then
available on the underlying storage and when you call nvalchemi.data.batch.Batch.get_data()
or nvalchemi.data.batch.Batch.to_data_list(), so each nvalchemi.data.AtomicData
gets the correct slice.
batch.add_key("node_feat", [torch.randn(2, 4), torch.randn(3, 4)], level="node")
batch.add_key(
"energy",
[torch.tensor([[0.1]]), torch.tensor([[0.2]])],
level="system",
overwrite=True,
)
list_of_data = batch.to_data_list()
# list_of_data[i] now has "node_feat" and "energy" with the right shapes.
Device and serialization#
Device: Use
nvalchemi.data.batch.Batch.to()\ (device) or the mixinnvalchemi.data.data.DataMixin.to()onnvalchemi.data.AtomicData. The batch implementation delegates to the underlying storage for efficiency.Serialization:
nvalchemi.data.AtomicDatasupports Pydantic serialization (e.g.model_dump,model_dump_json). Tensor fields are serialized to lists in JSON mode.
How Batch stores data internally#
When you call nvalchemi.data.batch.Batch.from_data_list(), the resulting
Batch does not simply stack all tensors along a new “batch” axis. Different kinds
of data need different layouts, and the toolkit uses a storage model that reflects
this.
Every tensor attribute belongs to one of three levels:
Level |
Storage class |
Shape convention |
Examples |
|---|---|---|---|
system |
|
First dim = number of graphs |
|
atoms |
|
Concatenated across graphs |
|
edges |
|
Concatenated across graphs |
|
Uniform storage is straightforward: every graph contributes exactly one row, so
the i-th graph’s data is always at index i. System-level properties like the
simulation cell or total energy work this way.
Segmented storage is designed for variable-length data. Positions, for example,
are concatenated into a single tensor of shape (total_nodes, 3). To know where each
graph’s atoms start and end, the storage tracks segment_lengths and a pointer array
batch_ptr. The i-th graph’s nodes live at positions[batch_ptr[i]:batch_ptr[i+1]].
Edge data works the same way, with node-index offsets automatically applied to
neighbor_list so that each graph’s edges still point to the correct atoms in the
flattened array.
The mapping from attribute name to level is determined by a
DEFAULT_ATTRIBUTE_MAP. When you add a new key with
add_key(), you explicitly specify the level
("node", "edge", or "system") so the batch knows how to slice it back out when
you call get_data().
Neighbor list formats#
The framework supports two neighbor list representations, configured via
NeighborConfig and populated by
NeighborListHook at the BEFORE_COMPUTE
stage.
MATRIX format |
COO format |
|
|---|---|---|
Edge indices |
|
|
Per-atom counts |
|
(derived via |
CSR pointer |
(not used) |
|
PBC shifts |
|
|
Padding value |
|
(no padding — sparse) |
Configured via |
|
|
Used by |
Analytical-force models (LJ, Ewald, PME) |
GNN-based models (MACE, etc.) |
Here N is the total number of atoms in the batch, K is the maximum
number of neighbors per atom (max_neighbors), and E is the total
number of edges.
MATRIX format is a dense representation where each atom has a
fixed-width row of K neighbor indices. Unused slots are filled with the
sentinel value N (total atoms in the batch). Valid neighbors for atom i
are in neighbor_matrix[i, :num_neighbors[i]]. This format avoids
dynamic allocation and is used by analytical-force models that iterate over
pair interactions.
COO format is a sparse representation where each edge is an (i, j)
pair in neighbor_list. This format is used by GNN-based models that
operate on edge features. The per-atom CSR pointer edge_ptr is derived
on demand via the edge_ptr property.
Both formats are populated automatically by
NeighborListHook. The format is controlled
by the format field in the model’s
NeighborConfig.
Pre-allocated batches and the buffer API#
For training and data loading, from_data_list creates a batch that fits its data
exactly. But in high-throughput dynamics simulations, you often need a fixed-capacity
buffer that you fill and drain without reallocating memory: this abstraction is
used in the dynamics pipeline abstraction for point-to-point data sample passing,
which bypasses the need for host and/or file I/O.
Creating an empty buffer#
nvalchemi.data.batch.Batch.empty() allocates a batch with room for a
specified number of systems, nodes, and edges, but with zero graphs initially.
It requires a template (AtomicData or
Batch) that defines which keys to allocate and their
schema:
template = AtomicData(
positions=torch.zeros(1, 3),
atomic_numbers=torch.zeros(1, dtype=torch.long),
forces=torch.zeros(1, 3),
energy=torch.zeros(1, 1),
cell=torch.zeros(1, 3, 3),
pbc=torch.zeros(1, 3, dtype=torch.bool),
)
buffer = Batch.empty(
num_systems=64,
num_nodes=4096,
num_edges=32768,
template=template,
device="cuda",
)
All tensors are pre-allocated at the given capacity. The batch’s num_graphs starts
at zero.
Filling the buffer with put#
nvalchemi.data.batch.Batch.put() copies selected graphs from a source batch
into the buffer. A boolean mask selects which graphs to copy:
# Copy the first two graphs from incoming_batch into buffer
mask = torch.tensor([True, True, False, False])
buffer.put(incoming_batch, mask)
The method performs capacity checks to make sure the incoming segments fit, and uses optimized kernels for the data movement.
Compacting with defrag#
After graphs have been consumed (e.g. copied out to another stage), you remove them
with nvalchemi.data.batch.Batch.defrag(). This compacts the remaining graphs
to the front of the buffer so that freed capacity is available again:
# Mark which graphs have been copied out
copied_mask = torch.tensor([True, False, True])
buffer.defrag(copied_mask=copied_mask)
Resetting with zero#
nvalchemi.data.batch.Batch.zero() resets the batch to zero graphs while
keeping the allocated memory in place — useful at the start of a new epoch or
pipeline iteration.
These operations (empty / put / defrag / zero) form the backbone of the
dynamics pipeline’s inflight batching, where systems enter and leave a running
simulation at different times.
ASE Atoms interoperability#
The Atomic Simulation Environment (ASE) is the most widely-used Python library for representing and manipulating atomistic systems. The toolkit provides a conversion path so you can move data between ASE and ALCHEMI seamlessly.
Converting ASE Atoms to AtomicData#
nvalchemi.data.AtomicData.from_atoms() accepts an ase.Atoms object and
returns an nvalchemi.data.AtomicData:
from ase.build import molecule
from nvalchemi.data import AtomicData
atoms = molecule("H2O")
data = AtomicData.from_atoms(atoms, device="cpu")
The conversion maps ASE fields to ALCHEMI fields:
ASE source |
Field |
Notes |
|---|---|---|
|
|
Always populated |
|
|
Always populated |
|
|
Reshaped to |
|
|
Reshaped to |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Always populated |
|
|
Arrays, lists, ints, floats kept; bools/strings dropped |
Optional label fields (energy, forces, stress, virial, dipole,
charges, charge) are populated only when present in the ASE
object; otherwise they remain None. The input atoms object is not mutated.
Keyword arguments (energy_key, forces_key, etc.) let you adapt to different
naming conventions in your ASE dataset.
Atom categories#
AtomicData has an optional atom_categories field
(shape [n_nodes]) that classifies atoms using the
AtomCategory enum. This is used by dynamics hooks
such as FreezeAtomsHook, which freezes atoms
marked as AtomCategory.SPECIAL.
from_atoms does not set atom_categories automatically — you assign it after
construction based on your specific workflow. For example, in a slab+adsorbate
system you can use ASE tags to identify which atoms to freeze:
import torch
from ase.build import fcc111, molecule
from nvalchemi.data import AtomicData
from nvalchemi._typing import AtomCategory
slab = fcc111("Cu", size=(2, 2, 3), vacuum=10.0)
co = molecule("CO")
co.translate([slab.cell[0, 0] / 2, slab.cell[1, 1] / 3,
slab.positions[:, 2].max() + 1.8])
system = slab + co
data = AtomicData.from_atoms(system)
tags = torch.tensor(system.get_tags())
# tag 0 = adsorbate (free), tag >= 1 = slab (freeze)
data.atom_categories = torch.where(
tags > 0, AtomCategory.SPECIAL.value, AtomCategory.GAS.value
)
The full set of available categories is documented in
AtomCategory. For simple binary cases (free vs
frozen), the convention is GAS (0) for free atoms and SPECIAL (-1) for
frozen atoms.
Building a Batch from a list of Atoms#
There is no special bulk constructor — compose the two operations:
from ase.build import molecule
from nvalchemi.data import AtomicData, Batch
atoms_list = [molecule("H2O"), molecule("CH4")]
batch = Batch.from_data_list([AtomicData.from_atoms(a) for a in atoms_list])
Converting back to ASE Atoms#
The core library does not provide a to_atoms method, since the reverse mapping is
application-specific (e.g. which info keys to preserve, how to handle missing
fields). The examples directory includes a utility function that demonstrates the
reconstruction:
# From examples/basic/03_ase_integration.py
from ase import Atoms
def data_to_atoms(data: AtomicData) -> Atoms:
return Atoms(
numbers=data.atomic_numbers.cpu().numpy(),
positions=data.positions.cpu().numpy(),
cell=data.cell.squeeze(0).cpu().numpy() if data.cell is not None else None,
pbc=data.pbc.squeeze(0).cpu().numpy() if data.pbc is not None else None,
)
Tip
Converting a Batch to ase.Atoms should convert to AtomicData first
via Batch.to_data_list, and loop over individual AtomicData
entries then.
See also#
Examples: The gallery includes AtomicData and Batch: Graph-structured molecular data (
basic/01_data_structures.py) for a runnable script.API:
nvalchemi.datafor the full API of AtomicData, Batch, and the zarr-based reader/writer and dataloader.