AtomicData and Batch#

The ALCHEMI Toolkit represents molecular systems as graphs: atoms are nodes, and optional edges (e.g. bonds or radius-cutoff neighbors) connect them. The nvalchemi.data.AtomicData class holds a single graph (one molecule or structure), and nvalchemi.data.Batch batches many such graphs into one structure for efficient GPU-friendly training and inference.

AtomicData: a single graph#

nvalchemi.data.AtomicData is a Pydantic model that stores:

  • Required: positions (shape [n_nodes, 3]) and atomic_numbers (shape [n_nodes]).

  • Optional node-level: e.g. atomic_masses, forces, velocities, node_attrs.

  • Optional edge-level: edge_index (shape [2, n_edges]) and edge attributes such as shifts for periodicity.

  • Optional system-level: energies, cell, pbc, stresses, virials, etc.

All tensor fields use PyTorch tensors, so you can move them to GPU with .to(device) or use the mixin method nvalchemi.data.data.DataMixin.to() for device/dtype changes.

Example:

import torch
from nvalchemi.data import AtomicData

positions = torch.randn(5, 3)
atomic_numbers = torch.tensor([1, 6, 6, 1, 8], dtype=torch.long)
data = AtomicData(positions=positions, atomic_numbers=atomic_numbers)

# Optional: add system-level labels
data = AtomicData(
    positions=positions,
    atomic_numbers=atomic_numbers,
    energies=torch.tensor([[0.0]]),
)

Properties such as num_nodes, num_edges, and device are available; optional fields default to None when not provided.

Batch: multiple graphs#

nvalchemi.data.Batch is built from a list of nvalchemi.data.AtomicData instances. Node tensors are concatenated along the first dimension; edge tensors are concatenated with node-index offsets so each graph’s edges refer to the correct atoms. System-level tensors are stacked so that the first dimension is the number of graphs.

  • Build a batch: nvalchemi.data.batch.Batch.from_data_list()\ (data_list).

  • Access batch size: num_graphs, num_nodes, num_edges, num_nodes_list, num_edges_list.

  • Recover a single graph: nvalchemi.data.batch.Batch.get_data()\ (index).

  • Recover all graphs: nvalchemi.data.batch.Batch.to_data_list()\ ().

Example:

import torch
from nvalchemi.data import AtomicData, Batch

data_list = [
    AtomicData(
        positions=torch.randn(2, 3),
        atomic_numbers=torch.ones(2, dtype=torch.long),
        energies=torch.zeros(1, 1),
    ),
    AtomicData(
        positions=torch.randn(3, 3),
        atomic_numbers=torch.ones(3, dtype=torch.long),
        energies=torch.zeros(1, 1),
    ),
]
batch = Batch.from_data_list(data_list)

print(batch.num_graphs, batch.num_nodes, batch.num_nodes_list)  # 2, 5, [2, 3]
first = batch.get_data(0)
again = batch.to_data_list()

Indexing and selection#

Batch supports bracket indexing that mirrors familiar Python and PyTorch conventions. The type of index determines what you get back:

Index type

Returns

Example

str

The raw tensor attribute by name

batch["positions"]

int

A single AtomicData (via get_data)

batch[0]

slice

A new Batch with the selected graphs

batch[1:3]

Tensor / list[int]

A new Batch with the selected graphs

batch[torch.tensor([0, 2])]

When selecting multiple graphs (slice, tensor, or list), the underlying index_select() method operates directly on the concatenated storage — it slices segments and adjusts edge_index offsets without reconstructing individual AtomicData objects, so it is efficient even for large batches.

# Select a sub-batch of graphs 0 and 2
sub = batch[torch.tensor([0, 2])]
print(sub.num_graphs)  # 2

# String indexing accesses the raw concatenated tensor
all_positions = batch["positions"]  # shape (total_nodes, 3)

Adding keys to a batch#

You can add new tensor keys (e.g. model outputs or extra labels) at node, edge, or system level with nvalchemi.data.batch.Batch.add_key(). The new key is then available on the underlying storage and when you call nvalchemi.data.batch.Batch.get_data() or nvalchemi.data.batch.Batch.to_data_list(), so each nvalchemi.data.AtomicData gets the correct slice.

batch.add_key("node_feat", [torch.randn(2, 4), torch.randn(3, 4)], level="node")
batch.add_key(
    "energies",
    [torch.tensor([[0.1]]), torch.tensor([[0.2]])],
    level="system",
    overwrite=True,
)
list_of_data = batch.to_data_list()
# list_of_data[i] now has "node_feat" and "energies" with the right shapes.

Device and serialization#

  • Device: Use nvalchemi.data.batch.Batch.to()\ (device) or the mixin nvalchemi.data.data.DataMixin.to() on nvalchemi.data.AtomicData. The batch implementation delegates to the underlying storage for efficiency.

  • Serialization: nvalchemi.data.AtomicData supports Pydantic serialization (e.g. model_dump, model_dump_json). Tensor fields are serialized to lists in JSON mode.

How Batch stores data internally#

When you call nvalchemi.data.batch.Batch.from_data_list(), the resulting Batch does not simply stack all tensors along a new “batch” axis. Different kinds of data need different layouts, and the toolkit uses a storage model that reflects this.

Every tensor attribute belongs to one of three levels:

Level

Storage class

Shape convention

Examples

system

UniformLevelStorage

First dim = number of graphs

cell, pbc, energies, stresses

atoms

SegmentedLevelStorage

Concatenated across graphs

positions, atomic_numbers, forces

edges

SegmentedLevelStorage

Concatenated across graphs

edge_index, shifts, edge_embeddings

Uniform storage is straightforward: every graph contributes exactly one row, so the i-th graph’s data is always at index i. System-level properties like the simulation cell or total energy work this way.

Segmented storage is designed for variable-length data. Positions, for example, are concatenated into a single tensor of shape (total_nodes, 3). To know where each graph’s atoms start and end, the storage tracks segment_lengths and a pointer array batch_ptr. The i-th graph’s nodes live at positions[batch_ptr[i]:batch_ptr[i+1]]. Edge data works the same way, with node-index offsets automatically applied to edge_index so that each graph’s edges still point to the correct atoms in the flattened array.

The mapping from attribute name to level is determined by a DEFAULT_ATTRIBUTE_MAP. When you add a new key with add_key(), you explicitly specify the level ("node", "edge", or "system") so the batch knows how to slice it back out when you call get_data().

Pre-allocated batches and the buffer API#

For training and data loading, from_data_list creates a batch that fits its data exactly. But in high-throughput dynamics simulations, you often need a fixed-capacity buffer that you fill and drain without reallocating memory: this abstraction is used in the dynamics pipeline abstraction for point-to-point data sample passing, which bypasses the need for host and/or file I/O.

Creating an empty buffer#

nvalchemi.data.batch.Batch.empty() allocates a batch with room for a specified number of systems, nodes, and edges, but with zero graphs initially. It requires a template (AtomicData or Batch) that defines which keys to allocate and their schema:

template = AtomicData(
    positions=torch.zeros(1, 3),
    atomic_numbers=torch.zeros(1, dtype=torch.long),
    forces=torch.zeros(1, 3),
    energies=torch.zeros(1, 1),
    cell=torch.zeros(1, 3, 3),
    pbc=torch.zeros(1, 3, dtype=torch.bool),
)
buffer = Batch.empty(
    num_systems=64,
    num_nodes=4096,
    num_edges=32768,
    template=template,
    device="cuda",
)

All tensors are pre-allocated at the given capacity. The batch’s num_graphs starts at zero.

Filling the buffer with put#

nvalchemi.data.batch.Batch.put() copies selected graphs from a source batch into the buffer. A boolean mask selects which graphs to copy:

# Copy the first two graphs from incoming_batch into buffer
mask = torch.tensor([True, True, False, False])
buffer.put(incoming_batch, mask)

The method performs capacity checks to make sure the incoming segments fit, and uses optimized kernels for the data movement.

Compacting with defrag#

After graphs have been consumed (e.g. copied out to another stage), you remove them with nvalchemi.data.batch.Batch.defrag(). This compacts the remaining graphs to the front of the buffer so that freed capacity is available again:

# Mark which graphs have been copied out
copied_mask = torch.tensor([True, False, True])
buffer.defrag(copied_mask=copied_mask)

Resetting with zero#

nvalchemi.data.batch.Batch.zero() resets the batch to zero graphs while keeping the allocated memory in place — useful at the start of a new epoch or pipeline iteration.

These operations (empty / put / defrag / zero) form the backbone of the dynamics pipeline’s inflight batching, where systems enter and leave a running simulation at different times.

ASE Atoms interoperability#

The Atomic Simulation Environment (ASE) is the most widely-used Python library for representing and manipulating atomistic systems. The toolkit provides a conversion path so you can move data between ASE and ALCHEMI seamlessly.

Converting ASE Atoms to AtomicData#

nvalchemi.data.AtomicData.from_atoms() accepts an ase.Atoms object and returns an nvalchemi.data.AtomicData:

from ase.build import molecule
from nvalchemi.data import AtomicData

atoms = molecule("H2O")
data = AtomicData.from_atoms(atoms, device="cpu")

The conversion maps ASE fields to ALCHEMI fields:

ASE source

AtomicData field

Notes

atoms.numbers

atomic_numbers

atoms.positions

positions

atoms.get_pbc()

pbc

Reshaped to (1, 3)

atoms.get_cell()

cell

Reshaped to (1, 3, 3)

atoms.info[energy_key]

energies

Default key: "energy"

atoms.arrays[forces_key]

forces

Default key: "forces"

atoms.info[stress_key]

stresses

Voigt vector converted to (1, 3, 3) matrix

atoms.info[virials_key]

virials

Voigt vector converted to (1, 3, 3) matrix

atoms.get_tags()

atom_categories

0 = GAS, 1 = SURFACE, 2+ = BULK

atoms.get_masses()

atomic_masses

atoms.info (remaining)

preserved

Filtered to tensors, arrays, and floats

Keyword arguments (energy_key, forces_key, etc.) let you adapt to different naming conventions in your ASE dataset.

Building a Batch from a list of Atoms#

There is no special bulk constructor — compose the two operations:

from ase.build import molecule
from nvalchemi.data import AtomicData, Batch

atoms_list = [molecule("H2O"), molecule("CH4")]
batch = Batch.from_data_list([AtomicData.from_atoms(a) for a in atoms_list])

Converting back to ASE Atoms#

The core library does not provide a to_atoms method, since the reverse mapping is application-specific (e.g. which info keys to preserve, how to handle missing fields). The examples directory includes a utility function that demonstrates the reconstruction:

# From examples/04_ase_dynamics_example.py
from ase import Atoms

def data_to_atoms(data: AtomicData) -> Atoms:
    return Atoms(
        numbers=data.atomic_numbers.cpu().numpy(),
        positions=data.positions.cpu().numpy(),
        cell=data.cell.squeeze(0).cpu().numpy() if data.cell is not None else None,
        pbc=data.pbc.squeeze(0).cpu().numpy() if data.pbc is not None else None,
    )

Tip

Converting a Batch to ase.Atoms should convert to AtomicData first via Batch.to_data_list, and loop over individual AtomicData entries then.

See also#

  • Examples: The gallery includes AtomicData and Batch: Graph-structured molecular data (01_data_example.py) for a runnable script.

  • API: nvalchemi.data for the full API of AtomicData, Batch, and the zarr-based reader/writer and dataloader.