nvalchemi.data.AtomicDataZarrWriter#

class nvalchemi.data.AtomicDataZarrWriter(store)[source]#

Writer for serializing AtomicData into Zarr stores.

Writes AtomicData objects into a structured Zarr store with CSR-style pointer arrays for variable-size graph data. Supports single writes, batch writes, appending, custom fields, soft-delete, and defragmentation.

The Zarr store layout is:

dataset.zarr/ ├── meta/ # Pointer arrays + masks │ ├── atoms_ptr # int64 [N+1] — cumulative node counts │ ├── edges_ptr # int64 [N+1] — cumulative edge counts │ ├── samples_mask # bool [N] — False = deleted sample │ ├── atoms_mask # bool [V_total] — False = deleted atom │ └── edges_mask # bool [E_total] — False = deleted edge │ ├── core/ # AtomicData fields (auto-populated) │ ├── atomic_numbers # int64 [V_total] │ ├── positions # float32 [V_total, 3] │ └── … │ ├── custom/ # User-defined arrays (optional) │ └── <user_key> # any dtype, any shape │ └── .zattrs # root metadata

Parameters:

store (StoreLike) – Any zarr-compatible store: filesystem path (str or Path), or a zarr Store instance (LocalStore, MemoryStore, FsspecStore, etc.), StorePath, or a dict for in-memory buffer storage.

_store#

The zarr store used for I/O.

Type:

StoreLike

add_custom(key, data, level)[source]#

Add a custom array to the custom/ group.

Parameters:
  • key (str) – Name for the custom array.

  • data (torch.Tensor) – Tensor data. First dimension must match: - num_samples for “system” level - total atoms for “atom” level - total edges for “edge” level

  • level (str) – One of “atom”, “edge”, “system”.

Raises:
  • ValueError – If level is invalid or data shape doesn’t match.

  • FileNotFoundError – If store does not exist.

Return type:

None

append(data)[source]#

Append a single AtomicData to an existing Zarr store.

While this dispatch is available for convenience, we recommend users to try and amortize I/O operations by packing multiple data to write, instead of one at a time. This can be achieved by passing either a Batch object, or a list of AtomicData which will automatically form a batch.

Parameters:
  • data (Batch) – Single atomic data to append.

  • data – Batched atomic data to append.

Raises:
  • FileNotFoundError – If store does not exist.

  • .. py:function: – append(self, data: list[nvalchemi.data.atomic_data.AtomicData]) -> NoneType: :noindex:

  • Append a list of AtomicData to an existing Zarr store.

  • .. py:function: – append(self, data: nvalchemi.data.batch.Batch) -> NoneType: :noindex:

  • Append a Batch to an existing Zarr store.

  • This is the efficient bulk-append path. Since a Batch already has

  • all tensors concatenated (node/edge level) or stacked (system level),

  • each field is extended in a single I/O operation with no per-sample

  • iteration.

  • FileNotFoundError – If store does not exist.

Return type:

None

defragment()[source]#

Rewrite store excluding deleted samples.

Rebuilds all arrays, pointer arrays, and resets all masks to True.

Return type:

None

delete(indices)[source]#

Soft-delete samples by index.

Sets masks to False and zeros out data slices in core/ and custom/. Pointer arrays are NOT modified.

Parameters:

indices (list[int] | torch.Tensor) – Sample indices to delete.

Return type:

None

write(data)[source]#

Write a single AtomicData.

write(self, data: list[nvalchemi.data.atomic_data.AtomicData]) NoneType[source]
Parameters:

data (AtomicData)

Return type:

None

Write a list of AtomicData to a new Zarr store.

write(self, data: nvalchemi.data.batch.Batch) NoneType[source]
Parameters:

data (AtomicData)

Return type:

None

Write a Batch to a new Zarr store.

This is the efficient bulk-write path. Since a Batch already has all tensors concatenated (node/edge level) or stacked (system level), each field is written to zarr in a single I/O operation with no per-sample iteration.

Parameters:
Raises:
  • FileExistsError – If store already exists.

  • ValueError – If batch is empty.

Return type:

None