<!-- markdownlint-disable MD014 -->

(zarr_compression_guide)=

# Zarr Compression Tuning

Zarr stores are the primary persistence format for atomic simulation data in the
toolkit. Configuring compression and chunking correctly can reduce disk usage by
2–4× and significantly improve I/O throughput for data pipelines. This
guide covers the configuration options, codec trade-offs, and practical recipes
for common workloads.

## Quick start

The simplest way to enable compression is to pass a
{py:class}`~nvalchemi.data.datapipes.ZarrWriteConfig` when creating a writer or
sink:

```python
from nvalchemi.data.datapipes import ZarrWriteConfig, ZarrArrayConfig
from nvalchemi.data.datapipes.backends.zarr import AtomicDataZarrWriter
from zarr.codecs import ZstdCodec

config = ZarrWriteConfig(
    core=ZarrArrayConfig(compressors=(ZstdCodec(level=3),)),
)
writer = AtomicDataZarrWriter("/data/example.zarr", config=config)
```

For dynamics trajectories, pass the same config to
{py:class}`~nvalchemi.dynamics.sinks.ZarrData`:

```python
from nvalchemi.dynamics.sinks import ZarrData

sink = ZarrData("/tmp/trajectory.zarr", config=config)
```

```{tip}
The configuration classes are Pydantic models, and you do not need to
import and construct them manually: you can pass a `dict` with the
same structure and keys and under the hood they will be validated
against the configuration classes. Using the classes explicitly is
helpful, however, when working with modern IDEs and language servers
as they tell you what arguments are required, defaults, etc.
```

## Configuration hierarchy

The toolkit organises Zarr arrays into three logical groups:

| Group | Contents | Default compression |
|-------|----------|---------------------|
| `meta` | Pointer arrays (`atoms_ptr`, `edges_ptr`), validity mask | None |
| `core` | Positions, forces, energy, atomic numbers, cell, pbc | None |
| `custom` | User-added arrays via `AtomicData.custom` | None |

{py:class}`~nvalchemi.data.datapipes.ZarrWriteConfig` lets you set different
{py:class}`~nvalchemi.data.datapipes.ZarrArrayConfig` for each group:

```python
config = ZarrWriteConfig(
    meta=ZarrArrayConfig(...),    # metadata arrays
    core=ZarrArrayConfig(...),    # core physics arrays
    custom=ZarrArrayConfig(...),  # user-added arrays
)
```

### Field overrides

For fine-grained control, `field_overrides` takes precedence over group defaults.
Resolution order:

```text
field_overrides["positions"]   →   if present, use this
         ↓ (not found)
core (group default)           →   if present, use this
         ↓ (not configured)
no compression (Zarr defaults)
```

```{tip}
Use `field_overrides` when a single array has different access patterns from
its group — for example, if positions need fast random access while other core
arrays are read sequentially.
```

## Codec comparison

Zarr v3 supports pluggable codecs via the `zarr.abc.codec.Codec` interface. The
toolkit has been tested with the following:

| Codec | Class | Strengths | Weaknesses | Typical use |
|-------|-------|-----------|------------|-------------|
| Zstd | `zarr.codecs.ZstdCodec` | Good ratio, fast decompress | Moderate compress speed | General purpose, sequential data |
| Blosc/LZ4 | `zarr.codecs.BloscCodec(cname="lz4")` | Very fast compress+decompress | Lower ratio | Trajectories, real-time I/O |
| Blosc/Zstd | `zarr.codecs.BloscCodec(cname="zstd")` | Blosc multithreading + Zstd ratio | Slightly more complex | Large arrays, parallel writes |
| Gzip | `zarr.codecs.GzipCodec` | Universal compatibility | Slow | Archival, interop |

```{note}
Compression level controls the ratio/speed trade-off. Higher levels yield better
compression but slower writes. For Zstd, level 3 is a good default; level 5–9
improves ratio modestly at the cost of write throughput. For LZ4, the level
parameter has minimal effect---speed is consistently high.
```

### Blosc multithreading

`BloscCodec` can use multiple threads internally, which helps when compressing
large chunks. By default it uses a single thread; pass `nthreads=4` (or similar)
if your workload benefits from parallel compression:

```python
from zarr.codecs import BloscCodec

compressor = BloscCodec(cname="zstd", clevel=5, nthreads=4)
```

## Chunk size tuning

The `chunk_size` parameter in {py:class}`~nvalchemi.data.datapipes.ZarrArrayConfig`
controls the chunk length along **dimension 0** of the stored array. Other
dimensions use the full extent. Because atom-level fields (positions, forces,
atomic_numbers) are stored **concatenated** along the atom axis — not per
structure — dimension 0 is the total-atoms axis, not the number of structures.

### Target chunk size

The Zarr documentation recommends chunks of **at least 1 MB uncompressed** for good
throughput, particularly when using Blosc. Smaller chunks increase per-chunk
overhead (metadata, system calls, compression dictionary resets). Larger chunks
reduce the number of I/O operations for sequential reads but increase
**read amplification** for random access — reading a single 50-atom structure
(600 bytes of positions) from a 1 MB chunk wastes 99.9 % of the decompressed data.

| Access pattern | Recommended chunk target | Rationale |
|----------------|--------------------------|-----------|
| Sequential DataLoader | 1–4 MB | Amortises overhead across many samples |
| Trajectory capture (append, then sequential read) | 1 MB | Balances write latency and read throughput |
| Random access (visualisation, single-sample lookup) | 64–256 KB | Limits read amplification |

```{note}
Zarr v3 supports **sharding**, which decouples the read unit (chunk) from the
storage unit (shard). With sharding you can have small chunks for fine-grained
random access grouped into large shards for filesystem efficiency. Set
``shard_size`` on {py:class}`~nvalchemi.data.datapipes.ZarrArrayConfig` to
enable it — the shard size must be a multiple of the chunk size.
```

### Back-of-the-envelope formula

For a stored array whose rows have `trailing_dims` trailing dimensions and
dtype size `d` bytes:

$$
\text{bytes\_per\_row} = d \times \prod(\text{trailing\_dims})
$$

$$
\text{chunk\_size} = \left\lfloor \frac{\text{target\_bytes}}{\text{bytes\_per\_row}} \right\rfloor
$$

The following table gives concrete values for common arrays:

| Array | Trailing dims | Dtype | Bytes/row | chunk_size (1 MB) | chunk_size (4 MB) |
|-------|---------------|-------|-----------|-------------------|-------------------|
| positions `[V, 3]` | 3 | float32 | 12 | 83,333 | 333,333 |
| forces `[V, 3]` | 3 | float32 | 12 | 83,333 | 333,333 |
| atomic_numbers `[V]` | 1 | int64 | 8 | 125,000 | 500,000 |
| energy `[B]` | 1 | float64 | 8 | 125,000 | 500,000 |
| cell `[B, 3, 3]` | 9 | float32 | 36 | 27,778 | 111,111 |
| neighbor_list `[E, 2]` | 2 | int64 | 16 | 62,500 | 250,000 |
| shifts `[E, 3]` | 3 | float32 | 12 | 83,333 | 333,333 |

**Example: positions (float32, shape [V, 3]), 1 MB target**

$$
\text{bytes\_per\_row} = 3 \times 4 = 12 \text{ bytes}
$$
$$
\text{chunk\_size} = \left\lfloor \frac{1{,}000{,}000}{12} \right\rfloor = 83{,}333
$$

**Example: energy (float64, shape [B]), 1 MB target**

$$
\text{bytes\_per\_row} = 1 \times 8 = 8 \text{ bytes}
$$
$$
\text{chunk\_size} = \left\lfloor \frac{1{,}000{,}000}{8} \right\rfloor = 125{,}000
$$

### Read amplification

When reading a single structure by index, the reader fetches the slice
`positions[atoms_ptr[i]:atoms_ptr[i+1], :]` — typically ~50 rows (600 bytes).
With large chunks, most of the decompressed data is discarded:

| chunk_size | Chunk bytes (positions) | Amplification (50-atom read) |
|------------|------------------------|------------------------------|
| 333,333 | 4 MB | 6,667× |
| 83,333 | 1 MB | 1,667× |
| 10,000 | 120 KB | 200× |

For purely sequential workloads (sequential DataLoader) amplification does not
matter — every row is consumed. For random-access workloads, prefer smaller
chunks or consider field overrides for frequently accessed arrays.

```{warning}
Atom-level fields (positions, forces, atomic_numbers) are stored as
**concatenated** arrays of shape `[V_total, ...]` where `V_total` is the sum of
atoms across all structures. The `chunk_size` parameter controls the number of
**rows** in each chunk, not the number of structures. System-level fields
(energy, cell, pbc) have one row per structure, so `chunk_size` directly equals
the number of structures per chunk.
```

## Storage estimation

The tables below assume 50 atoms per structure on average with ~200 edges
(a typical cutoff-based neighbour list). Edge arrays dominate storage; many
workflows recompute edges at load time via neighbour lists and omit them from
the store.

### Per-array breakdown (100k structures)

| Array | Shape | Dtype | Uncompressed |
|-------|-------|-------|-------------|
| positions | [5M, 3] | float32 | 60 MB |
| forces | [5M, 3] | float32 | 60 MB |
| atomic_numbers | [5M] | int64 | 40 MB |
| energy | [100k] | float64 | 0.8 MB |
| cell | [100k, 3, 3] | float32 | 3.6 MB |
| pbc | [100k, 3] | bool | 0.3 MB |
| stress | [100k, 3, 3] | float32 | 3.6 MB |
| virial | [100k, 3, 3] | float32 | 3.6 MB |
| dipole | [100k, 3] | float32 | 1.2 MB |
| neighbor_list | [20M, 2] | int64 | 320 MB |
| shifts | [20M, 3] | float32 | 240 MB |
| metadata (ptrs, masks) | — | mixed | 27 MB |
| **Total (with edges)** | | | **760 MB** |
| **Total (without edges)** | | | **200 MB** |

### Scaling by dataset size

| Component | 100k | 1M | 10M |
|-----------|------|-----|------|
| Node + system core | 173 MB | 1.7 GB | 17 GB |
| Edge arrays | 560 MB | 5.6 GB | 56 GB |
| Metadata | 27 MB | 267 MB | 2.7 GB |
| **Total (with edges)** | **760 MB** | **7.6 GB** | **76 GB** |
| **Total (without edges)** | **200 MB** | **2.0 GB** | **20 GB** |

### With compression

| Codec | Typical ratio | 100k | 1M | 10M |
|-------|---------------|------|-----|------|
| Zstd (level 3) | 2–4× | 190–380 MB | 1.9–3.8 GB | 19–38 GB |
| LZ4 | 1.5–2.5× | 300–510 MB | 3.0–5.1 GB | 30–51 GB |

```{note}
Actual ratios depend heavily on data characteristics. Smooth MD trajectories
(correlated frames) compress 4–6×; random equilibrium structures compress 2–3×.
Integer arrays (atomic numbers, pointers) often compress 5–10× due to repetition.
The estimates above include edge arrays; without edges, divide by ~3.8.

The [I/O benchmark tool](io_benchmark_section) uses purely random tensors, so
its measured ratios (~1.75× Zstd, ~1.63× LZ4) represent a worst case. Real
molecular data will compress significantly better.
```

### File count

Without sharding, each chunk becomes a separate file on local stores. A
Zarr store also contains one `zarr.json` metadata file per array and per
group, so the **total file count** across the whole store is the sum of
chunk files for every array plus metadata files (~20 for a typical store).

The table below shows **chunk files per array** for the positions array
(`[V_total, 3]` float32), which is representative of other atom-level arrays:

| chunk_size | 100k (V = 5M) | 1M (V = 50M) | 10M (V = 500M) |
|------------|--------------|--------------|----------------|
| 83,333 (1 MB) | 61 | 601 | 6,001 |
| 10,000 (120 KB) | 500 | 5,000 | 50,000 |

A typical store has ~10 chunked arrays, so **multiply by ~10** for total
chunk files, then add ~20 metadata files. At 100k systems with
`chunk_size=10,000`, the TUI reports **~4,500 total files**; at 100k with
`chunk_size=83,333`, it reports **~690 total files**.

**With sharding** (`shard_size=500,000`, `chunk_size=10,000`), the same
100k-system store drops to **~160 total files** — a 28× reduction — because
each shard file bundles 50 chunks.

Filesystem metadata overhead becomes significant above ~10,000 files per
array. If you need small chunks for random access at scale, enable sharding
with ``shard_size`` or use a cloud object store (S3, GCS via `FsspecStore`).

## Recipes

### Recipe 1: Sequential dataset (best compression)

Prioritise disk space over write speed. Use Zstd at a moderate level with large
chunks (~1 MB per chunk) for sequential reads.

```python
from nvalchemi.data.datapipes import ZarrWriteConfig, ZarrArrayConfig
from nvalchemi.data.datapipes.backends.zarr import AtomicDataZarrWriter
from zarr.codecs import ZstdCodec

config = ZarrWriteConfig(
    core=ZarrArrayConfig(
        compressors=(ZstdCodec(level=5),),
        chunk_size=100_000,   # ~1.2 MB chunks for positions [V,3] f32
    ),
)
writer = AtomicDataZarrWriter("/data/example.zarr", config=config)
```

### Recipe 2: Dynamics trajectory (fast I/O)

Prioritise write throughput for real-time trajectory capture. Use LZ4 with
moderate chunks (~120 KB) to balance write latency and random-access readback.

```python
from nvalchemi.dynamics.sinks import ZarrData
from nvalchemi.data.datapipes import ZarrWriteConfig, ZarrArrayConfig
from zarr.codecs import BloscCodec

config = ZarrWriteConfig(
    core=ZarrArrayConfig(
        compressors=(BloscCodec(cname="lz4"),),
        chunk_size=10_000,    # ~120 KB chunks for positions [V,3] f32
    ),
)
sink = ZarrData("/tmp/trajectory.zarr", config=config)
```

### Recipe 3: Per-field override (mixed access patterns)

Use Zstd for most arrays but LZ4 with smaller chunks for positions (frequently
accessed for visualisation or neighbour list rebuilds).

```python
from nvalchemi.data.datapipes import ZarrWriteConfig, ZarrArrayConfig
from nvalchemi.data.datapipes.backends.zarr import AtomicDataZarrWriter
from zarr.codecs import ZstdCodec, BloscCodec

config = ZarrWriteConfig(
    core=ZarrArrayConfig(
        compressors=(ZstdCodec(level=3),),
        chunk_size=100_000,   # 1 MB chunks for sequential core arrays
    ),
    field_overrides={
        "positions": ZarrArrayConfig(
            compressors=(BloscCodec(cname="lz4"),),
            chunk_size=50_000,  # ~600 KB: smaller for random access
        ),
    },
)
writer = AtomicDataZarrWriter("/data/mixed.zarr", config=config)
```

### Recipe 4: Sparse data (skip empty chunks)

For datasets with many optional fields or sparse validity masks, disable writing
empty chunks to save space.

```python
from nvalchemi.data.datapipes import ZarrWriteConfig, ZarrArrayConfig
from nvalchemi.data.datapipes.backends.zarr import AtomicDataZarrWriter
from zarr.codecs import ZstdCodec

config = ZarrWriteConfig(
    core=ZarrArrayConfig(
        compressors=(ZstdCodec(level=3),),
        write_empty_chunks=False,
    ),
    custom=ZarrArrayConfig(
        compressors=(ZstdCodec(level=3),),
        write_empty_chunks=False,
    ),
)
writer = AtomicDataZarrWriter("/data/sparse.zarr", config=config)
```

```{tip}
`write_empty_chunks=False` is especially useful for custom arrays that are only
populated for a subset of structures. Zarr will skip writing chunks that contain
only the fill value, reducing both disk usage and write time.
```

### Recipe 5: Sharded storage (large datasets)

For datasets with millions of structures, use sharding to keep small read-friendly
chunks while reducing the number of storage objects. The shard size must be a
multiple of the chunk size.

```python
from nvalchemi.data.datapipes import ZarrWriteConfig, ZarrArrayConfig
from nvalchemi.data.datapipes.backends.zarr import AtomicDataZarrWriter
from zarr.codecs import ZstdCodec

config = ZarrWriteConfig(
    core=ZarrArrayConfig(
        compressors=(ZstdCodec(level=3),),
        chunk_size=10_000,     # 120 KB chunks for random access
        shard_size=500_000,    # 50 chunks per shard, ~6 MB per shard
    ),
)
writer = AtomicDataZarrWriter("/data/large.zarr", config=config)
```

```{tip}
Sharding is particularly valuable on local filesystems with large datasets
where file count can become a bottleneck. With 10M structures and
``chunk_size=10,000``, you would get 50,000 files per array without sharding
versus only 1,000 shard files with ``shard_size=500,000``.
```

(io_benchmark_section)=

## I/O benchmark tool

The toolkit ships a command-line benchmark for measuring Zarr write throughput
and compression ratios on synthetic data. Use it to validate configuration
choices before committing to a production workflow.

### Running the benchmark

```bash
# Install (if not already)
$ uv sync --all-extras

# Basic: compare codec overhead across dataset sizes
$ nvalchemi-io-test -n 1000 -n 10000 --codec zstd --level 3 --chunk-size 83333

# Fast codec with smaller chunks for trajectory-style workloads
$ nvalchemi-io-test -n 1000 -n 10000 --codec lz4 --chunk-size 10000

# Larger molecules with edge-specific chunking
$ nvalchemi-io-test -n 1000 -n 10000 --min-atoms 100 --max-atoms 500 \
    --codec zstd --chunk-size 83333 --edge-chunk-size 62500

# With sharding enabled
$ nvalchemi-io-test -n 1000 -n 10000 --codec zstd \
    --chunk-size 1000 --shard-size 10000
```

Key options:

| Option | Default | Description |
|--------|---------|-------------|
| `-n` / `--num-systems` | 1000 10000 100000 | Dataset sizes to benchmark (repeatable) |
| `--min-atoms` | 10 | Minimum atoms per structure |
| `--max-atoms` | 100 | Maximum atoms per structure |
| `--codec` | — | Compression codec: `zstd`, `lz4`, or `blosc-zstd` |
| `--level` | 3 | Compression level |
| `--chunk-size` | — | Chunk size for node/system arrays |
| `--shard-size` | — | Shard size for node/system arrays |
| `--edge-chunk-size` | — | Chunk size for edge arrays (neighbor_list, shifts) |
| `--edge-shard-size` | — | Shard size for edge arrays |

### Example output

**Small molecules (10–100 atoms), Zstd level 3, 1 MB chunks:**

```text
nvalchemi Zarr I/O benchmark  atoms=10-100  config=zstd L3, chunk=83,333,
                                             edge_chunk=62,500
Pre-computed: 100,000 systems, 5,504,449 total atoms (avg 55.0),
              11,062,584 total edges (avg 110.6)
Estimated uncompressed: 484.9 MB

      Zarr I/O Benchmark — zstd L3, chunk=83,333, edge_chunk=62,500

              Avg     Avg      Raw     Disk                   Write
  Systems   atoms   edges     size     size  Ratio  Files      time  Systems/s
 ─────────────────────────────────────────────────────────────────────────────
    1,000      56     115   4.8 MB   2.8 MB  1.74x     36     0.14s     7,282
   10,000      55     112  47.1 MB  27.0 MB  1.75x     96     0.48s    20,736
  100,000      55     111 467.5 MB 267.7 MB  1.75x    691     4.66s    21,471
```

**Small molecules, LZ4, 120 KB chunks (trajectory-optimised):**

```text
nvalchemi Zarr I/O benchmark  atoms=10-100  config=lz4 L3, chunk=10,000,
                                             edge_chunk=10,000

      Zarr I/O Benchmark — lz4 L3, chunk=10,000, edge_chunk=10,000

              Avg     Avg      Raw     Disk                   Write
  Systems   atoms   edges     size     size  Ratio  Files      time  Systems/s
 ─────────────────────────────────────────────────────────────────────────────
    1,000      56     115   4.8 MB   3.0 MB  1.61x     76     0.12s     8,207
   10,000      55     112  47.1 MB  28.9 MB  1.63x    480     0.80s    12,446
  100,000      55     111 467.5 MB 287.5 MB  1.63x  4,509     8.10s    12,341
```

**Small molecules, sharded (chunk=10,000 inside shard=500,000):**

```text
nvalchemi Zarr I/O benchmark  atoms=10-100  config=chunk=10,000,
    shard=500,000, edge_chunk=10,000, edge_shard=500,000

      Zarr I/O Benchmark — chunk=10,000, shard=500,000,
                            edge_chunk=10,000, edge_shard=500,000

              Avg     Avg      Raw     Disk                   Write
  Systems   atoms   edges     size     size  Ratio  Files      time  Systems/s
 ─────────────────────────────────────────────────────────────────────────────
    1,000      56     115   4.8 MB   2.8 MB  1.73x     34     0.14s     6,998
   10,000      55     112  47.1 MB  27.0 MB  1.74x     46     0.63s    15,930
  100,000      55     111 467.5 MB 268.2 MB  1.74x    158     6.46s    15,471
```

Note the dramatic file count reduction with sharding: **4,509 → 158** at 100k
systems with the same chunk size, while compression ratio and disk size remain
essentially unchanged.

**Larger molecules (100–500 atoms), Zstd with edge-specific chunks:**

```text
nvalchemi Zarr I/O benchmark  atoms=100-500  config=zstd L3, chunk=83,333,
                                              edge_chunk=62,500
Pre-computed: 10,000 systems, 3,016,657 total atoms (avg 301.7),
              6,073,861 total edges (avg 607.4)
Estimated uncompressed: 263.5 MB

      Zarr I/O Benchmark — zstd L3, chunk=83,333, edge_chunk=62,500

              Avg     Avg      Raw     Disk                   Write
  Systems   atoms   edges     size     size  Ratio  Files      time  Systems/s
 ─────────────────────────────────────────────────────────────────────────────
    1,000     303     615  25.7 MB  15.4 MB  1.67x     66     0.21s     4,737
   10,000     302     607 254.7 MB 152.9 MB  1.67x    394     1.23s     8,138
```

```{note}
Zarr v3 defaults to ``ZstdCodec(level=0)`` when no compressor is specified.
The "Raw size" column reflects the data as written by the toolkit (including
Zarr metadata overhead), so even runs without an explicit ``--codec`` flag
will show some compression.
```

```{tip}
Run with ``--min-atoms`` and ``--max-atoms`` matching your actual dataset to get
realistic estimates. The benchmark uses uniform random atom counts; real-world
distributions may be skewed toward smaller or larger structures.
```

## See also

- **Data pipeline**: The [Data Loading Pipeline](datapipes_guide) guide covers
  readers, datasets, and dataloaders.
- **Dynamics sinks**: The [Data Sinks](dynamics_sinks_guide) guide explains how
  `ZarrData` integrates with snapshot hooks.
- **API reference**:
  - {py:class}`~nvalchemi.data.datapipes.ZarrWriteConfig`
  - {py:class}`~nvalchemi.data.datapipes.ZarrArrayConfig`
  - {py:class}`~nvalchemi.data.datapipes.backends.zarr.AtomicDataZarrWriter`
  - {py:class}`~nvalchemi.dynamics.sinks.ZarrData`