<!-- markdownlint-disable MD014 -->

(data_guide)=

# AtomicData and Batch

The ALCHEMI Toolkit represents molecular systems as **graphs**: atoms are nodes, and
optional edges (e.g. bonds or radius-cutoff neighbors) connect them. The
{py:class}`nvalchemi.data.AtomicData` class holds a single graph (one molecule or
structure), and {py:class}`nvalchemi.data.Batch` batches many such graphs into one
structure for efficient GPU-friendly training and inference.

## AtomicData: a single graph

{py:class}`nvalchemi.data.AtomicData` is a Pydantic model that stores:

- **Required**: `positions` (shape `[n_nodes, 3]`) and `atomic_numbers` (shape `[n_nodes]`).
- **Optional node-level**: e.g. `atomic_masses`, `forces`, `velocities`, `node_attrs`.
- **Optional edge-level**: `edge_index` (shape `[2, n_edges]`) and edge attributes such
as `shifts` for periodicity.
- **Optional system-level**: `energies`, `cell`, `pbc`, `stresses`, `virials`, etc.

All tensor fields use PyTorch tensors, so you can move them to GPU with `.to(device)` or
use the mixin method {py:meth}`nvalchemi.data.data.DataMixin.to` for device/dtype changes.

Example:

```python
import torch
from nvalchemi.data import AtomicData

positions = torch.randn(5, 3)
atomic_numbers = torch.tensor([1, 6, 6, 1, 8], dtype=torch.long)
data = AtomicData(positions=positions, atomic_numbers=atomic_numbers)

# Optional: add system-level labels
data = AtomicData(
    positions=positions,
    atomic_numbers=atomic_numbers,
    energies=torch.tensor([[0.0]]),
)
```

Properties such as `num_nodes`, `num_edges`, and `device` are available; optional
fields default to `None` when not provided.

## Batch: multiple graphs

{py:class}`nvalchemi.data.Batch` is built from a **list** of {py:class}`nvalchemi.data.AtomicData`
instances. Node tensors are concatenated along the first dimension; edge tensors are
concatenated with node-index offsets so each graph’s edges refer to the correct atoms.
System-level tensors are stacked so that the first dimension is the number of graphs.

- Build a batch: {py:meth}`nvalchemi.data.batch.Batch.from_data_list`\ (data_list).
- Access batch size: `num_graphs`, `num_nodes`, `num_edges`, `num_nodes_list`, `num_edges_list`.
- Recover a single graph: {py:meth}`nvalchemi.data.batch.Batch.get_data`\ (index).
- Recover all graphs: {py:meth}`nvalchemi.data.batch.Batch.to_data_list`\ ().

Example:

```python
import torch
from nvalchemi.data import AtomicData, Batch

data_list = [
    AtomicData(
        positions=torch.randn(2, 3),
        atomic_numbers=torch.ones(2, dtype=torch.long),
        energies=torch.zeros(1, 1),
    ),
    AtomicData(
        positions=torch.randn(3, 3),
        atomic_numbers=torch.ones(3, dtype=torch.long),
        energies=torch.zeros(1, 1),
    ),
]
batch = Batch.from_data_list(data_list)

print(batch.num_graphs, batch.num_nodes, batch.num_nodes_list)  # 2, 5, [2, 3]
first = batch.get_data(0)
again = batch.to_data_list()
```

### Indexing and selection

`Batch` supports bracket indexing that mirrors familiar Python and PyTorch
conventions. The type of index determines what you get back:

| Index type | Returns | Example |
|------------|---------|---------|
| `str` | The raw tensor attribute by name | `batch["positions"]` |
| `int` | A single {py:class}`~nvalchemi.data.AtomicData` (via `get_data`) | `batch[0]` |
| `slice` | A new {py:class}`~nvalchemi.data.Batch` with the selected graphs | `batch[1:3]` |
| `Tensor` / `list[int]` | A new {py:class}`~nvalchemi.data.Batch` with the selected graphs | `batch[torch.tensor([0, 2])]` |

When selecting multiple graphs (slice, tensor, or list), the underlying
{py:meth}`~nvalchemi.data.batch.Batch.index_select` method operates directly on the
concatenated storage --- it slices segments and adjusts `edge_index` offsets without
reconstructing individual `AtomicData` objects, so it is efficient even for large
batches.

```python
# Select a sub-batch of graphs 0 and 2
sub = batch[torch.tensor([0, 2])]
print(sub.num_graphs)  # 2

# String indexing accesses the raw concatenated tensor
all_positions = batch["positions"]  # shape (total_nodes, 3)
```

## Adding keys to a batch

You can add new tensor keys (e.g. model outputs or extra labels) at node, edge, or
system level with {py:meth}`nvalchemi.data.batch.Batch.add_key`. The new key is then
available on the underlying storage and when you call {py:meth}`nvalchemi.data.batch.Batch.get_data`
or {py:meth}`nvalchemi.data.batch.Batch.to_data_list`, so each {py:class}`nvalchemi.data.AtomicData`
gets the correct slice.

```python
batch.add_key("node_feat", [torch.randn(2, 4), torch.randn(3, 4)], level="node")
batch.add_key(
    "energies",
    [torch.tensor([[0.1]]), torch.tensor([[0.2]])],
    level="system",
    overwrite=True,
)
list_of_data = batch.to_data_list()
# list_of_data[i] now has "node_feat" and "energies" with the right shapes.
```

## Device and serialization

- **Device**: Use {py:meth}`nvalchemi.data.batch.Batch.to`\ (device) or the mixin
  {py:meth}`nvalchemi.data.data.DataMixin.to` on {py:class}`nvalchemi.data.AtomicData`.
  The batch implementation delegates to the underlying storage for efficiency.
- **Serialization**: {py:class}`nvalchemi.data.AtomicData` supports Pydantic
  serialization (e.g. `model_dump`, `model_dump_json`). Tensor fields are serialized
  to lists in JSON mode.

## How Batch stores data internally

When you call {py:meth}`nvalchemi.data.batch.Batch.from_data_list`, the resulting
`Batch` does not simply stack all tensors along a new "batch" axis. Different kinds
of data need different layouts, and the toolkit uses a storage model that reflects
this.

Every tensor attribute belongs to one of three **levels**:

| Level | Storage class | Shape convention | Examples |
|-----------|----------------------------|--------------------------------------|---------------------------------------------|
| **system** | {py:class}`~nvalchemi.data.level_storage.UniformLevelStorage` | First dim = number of graphs | `cell`, `pbc`, `energies`, `stresses` |
| **atoms** | {py:class}`~nvalchemi.data.level_storage.SegmentedLevelStorage` | Concatenated across graphs | `positions`, `atomic_numbers`, `forces` |
| **edges** | {py:class}`~nvalchemi.data.level_storage.SegmentedLevelStorage` | Concatenated across graphs | `edge_index`, `shifts`, `edge_embeddings` |

**Uniform storage** is straightforward: every graph contributes exactly one row, so
the i-th graph's data is always at index `i`. System-level properties like the
simulation cell or total energy work this way.

**Segmented storage** is designed for variable-length data. Positions, for example,
are concatenated into a single tensor of shape `(total_nodes, 3)`. To know where each
graph's atoms start and end, the storage tracks `segment_lengths` and a pointer array
`batch_ptr`. The i-th graph's nodes live at `positions[batch_ptr[i]:batch_ptr[i+1]]`.
Edge data works the same way, with node-index offsets automatically applied to
`edge_index` so that each graph's edges still point to the correct atoms in the
flattened array.

The mapping from attribute name to level is determined by a
{py:obj}`~nvalchemi.data.level_storage.DEFAULT_ATTRIBUTE_MAP`. When you add a new key with
{py:meth}`~nvalchemi.data.batch.Batch.add_key`, you explicitly specify the level
(`"node"`, `"edge"`, or `"system"`) so the batch knows how to slice it back out when
you call {py:meth}`~nvalchemi.data.batch.Batch.get_data`.

## Pre-allocated batches and the buffer API

For training and data loading, `from_data_list` creates a batch that fits its data
exactly. But in high-throughput dynamics simulations, you often need a **fixed-capacity
buffer** that you fill and drain without reallocating memory: this abstraction is
used in the dynamics pipeline abstraction for point-to-point data sample passing,
which bypasses the need for host and/or file I/O.

### Creating an empty buffer

{py:meth}`nvalchemi.data.batch.Batch.empty` allocates a batch with room for a
specified number of systems, nodes, and edges, but with zero graphs initially.
It requires a `template` ({py:class}`~nvalchemi.data.AtomicData` or
{py:class}`~nvalchemi.data.Batch`) that defines which keys to allocate and their
schema:

```python
template = AtomicData(
    positions=torch.zeros(1, 3),
    atomic_numbers=torch.zeros(1, dtype=torch.long),
    forces=torch.zeros(1, 3),
    energies=torch.zeros(1, 1),
    cell=torch.zeros(1, 3, 3),
    pbc=torch.zeros(1, 3, dtype=torch.bool),
)
buffer = Batch.empty(
    num_systems=64,
    num_nodes=4096,
    num_edges=32768,
    template=template,
    device="cuda",
)
```

All tensors are pre-allocated at the given capacity. The batch's `num_graphs` starts
at zero.

### Filling the buffer with `put`

{py:meth}`nvalchemi.data.batch.Batch.put` copies selected graphs from a source batch
into the buffer. A boolean `mask` selects which graphs to copy:

```python
# Copy the first two graphs from incoming_batch into buffer
mask = torch.tensor([True, True, False, False])
buffer.put(incoming_batch, mask)
```

The method performs capacity checks to make sure the incoming segments fit, and uses
optimized kernels for the data movement.

### Compacting with `defrag`

After graphs have been consumed (e.g. copied out to another stage), you remove them
with {py:meth}`nvalchemi.data.batch.Batch.defrag`. This compacts the remaining graphs
to the front of the buffer so that freed capacity is available again:

```python
# Mark which graphs have been copied out
copied_mask = torch.tensor([True, False, True])
buffer.defrag(copied_mask=copied_mask)
```

### Resetting with `zero`

{py:meth}`nvalchemi.data.batch.Batch.zero` resets the batch to zero graphs while
keeping the allocated memory in place --- useful at the start of a new epoch or
pipeline iteration.

These operations (`empty` / `put` / `defrag` / `zero`) form the backbone of the
dynamics pipeline's inflight batching, where systems enter and leave a running
simulation at different times.

## ASE Atoms interoperability

The [Atomic Simulation Environment (ASE)](https://ase-lib.org/about.html) is the
most widely-used Python library for representing and manipulating atomistic systems.
The toolkit provides a conversion path so you can move data between ASE and ALCHEMI
seamlessly.

### Converting ASE Atoms to AtomicData

{py:meth}`nvalchemi.data.AtomicData.from_atoms` accepts an `ase.Atoms` object and
returns an {py:class}`nvalchemi.data.AtomicData`:

```python
from ase.build import molecule
from nvalchemi.data import AtomicData

atoms = molecule("H2O")
data = AtomicData.from_atoms(atoms, device="cpu")
```

The conversion maps ASE fields to ALCHEMI fields:

| ASE source | AtomicData field | Notes |
|-------------------------------|----------------------|-----------------------------------------------|
| `atoms.numbers` | `atomic_numbers` | |
| `atoms.positions` | `positions` | |
| `atoms.get_pbc()` | `pbc` | Reshaped to `(1, 3)` |
| `atoms.get_cell()` | `cell` | Reshaped to `(1, 3, 3)` |
| `atoms.info[energy_key]` | `energies` | Default key: `"energy"` |
| `atoms.arrays[forces_key]` | `forces` | Default key: `"forces"` |
| `atoms.info[stress_key]` | `stresses` | Voigt vector converted to `(1, 3, 3)` matrix |
| `atoms.info[virials_key]` | `virials` | Voigt vector converted to `(1, 3, 3)` matrix |
| `atoms.get_tags()` | `atom_categories` | 0 = GAS, 1 = SURFACE, 2+ = BULK |
| `atoms.get_masses()` | `atomic_masses` | |
| `atoms.info` (remaining) | preserved | Filtered to tensors, arrays, and floats |

Keyword arguments (`energy_key`, `forces_key`, etc.) let you adapt to different
naming conventions in your ASE dataset.

### Building a Batch from a list of Atoms

There is no special bulk constructor --- compose the two operations:

```python
from ase.build import molecule
from nvalchemi.data import AtomicData, Batch

atoms_list = [molecule("H2O"), molecule("CH4")]
batch = Batch.from_data_list([AtomicData.from_atoms(a) for a in atoms_list])
```

### Converting back to ASE Atoms

The core library does not provide a `to_atoms` method, since the reverse mapping is
application-specific (e.g. which `info` keys to preserve, how to handle missing
fields). The examples directory includes a utility function that demonstrates the
reconstruction:

```python
# From examples/04_ase_dynamics_example.py
from ase import Atoms

def data_to_atoms(data: AtomicData) -> Atoms:
    return Atoms(
        numbers=data.atomic_numbers.cpu().numpy(),
        positions=data.positions.cpu().numpy(),
        cell=data.cell.squeeze(0).cpu().numpy() if data.cell is not None else None,
        pbc=data.pbc.squeeze(0).cpu().numpy() if data.pbc is not None else None,
    )
```

```{tip}
Converting a ``Batch`` to ``ase.Atoms`` should convert to ``AtomicData`` first
via ``Batch.to_data_list``, and loop over individual ``AtomicData``
entries then.
```

## See also

- **Examples**: The gallery includes **AtomicData and Batch: Graph-structured molecular data**
  (``01_data_example.py``) for a runnable script.
- **API**: {py:mod}`nvalchemi.data` for the full API of AtomicData, Batch, and the
  zarr-based reader/writer and dataloader.