Zarr Compression Tuning#

Zarr stores are the primary persistence format for atomic simulation data in the toolkit. Configuring compression and chunking correctly can reduce disk usage by 2–4× and significantly improve I/O throughput for data pipelines. This guide covers the configuration options, codec trade-offs, and practical recipes for common workloads.

Quick start#

The simplest way to enable compression is to pass a ZarrWriteConfig when creating a writer or sink:

from nvalchemi.data.datapipes import ZarrWriteConfig, ZarrArrayConfig
from nvalchemi.data.datapipes.backends.zarr import AtomicDataZarrWriter
from zarr.codecs import ZstdCodec

config = ZarrWriteConfig(
    core=ZarrArrayConfig(compressors=(ZstdCodec(level=3),)),
)
writer = AtomicDataZarrWriter("/data/example.zarr", config=config)

For dynamics trajectories, pass the same config to ZarrData:

from nvalchemi.dynamics.sinks import ZarrData

sink = ZarrData("/tmp/trajectory.zarr", config=config)

Tip

The configuration classes are Pydantic models, and you do not need to import and construct them manually: you can pass a dict with the same structure and keys and under the hood they will be validated against the configuration classes. Using the classes explicitly is helpful, however, when working with modern IDEs and language servers as they tell you what arguments are required, defaults, etc.

Configuration hierarchy#

The toolkit organises Zarr arrays into three logical groups:

Group	Contents	Default compression
`meta`	Pointer arrays (`atoms_ptr`, `edges_ptr`), validity mask	None
`core`	Positions, forces, energy, atomic numbers, cell, pbc	None
`custom`	User-added arrays via `AtomicData.custom`	None

ZarrWriteConfig lets you set different ZarrArrayConfig for each group:

config = ZarrWriteConfig(
    meta=ZarrArrayConfig(...),    # metadata arrays
    core=ZarrArrayConfig(...),    # core physics arrays
    custom=ZarrArrayConfig(...),  # user-added arrays
)

Field overrides#

For fine-grained control, field_overrides takes precedence over group defaults. Resolution order:

field_overrides["positions"]   →   if present, use this
         ↓ (not found)
core (group default)           →   if present, use this
         ↓ (not configured)
no compression (Zarr defaults)

Tip

Use field_overrides when a single array has different access patterns from its group — for example, if positions need fast random access while other core arrays are read sequentially.

Codec comparison#

Zarr v3 supports pluggable codecs via the zarr.abc.codec.Codec interface. The toolkit has been tested with the following:

Codec	Class	Strengths	Weaknesses	Typical use
Zstd	`zarr.codecs.ZstdCodec`	Good ratio, fast decompress	Moderate compress speed	General purpose, sequential data
Blosc/LZ4	`zarr.codecs.BloscCodec(cname="lz4")`	Very fast compress+decompress	Lower ratio	Trajectories, real-time I/O
Blosc/Zstd	`zarr.codecs.BloscCodec(cname="zstd")`	Blosc multithreading + Zstd ratio	Slightly more complex	Large arrays, parallel writes
Gzip	`zarr.codecs.GzipCodec`	Universal compatibility	Slow	Archival, interop

Note

Compression level controls the ratio/speed trade-off. Higher levels yield better compression but slower writes. For Zstd, level 3 is a good default; level 5–9 improves ratio modestly at the cost of write throughput. For LZ4, the level parameter has minimal effect—speed is consistently high.

Blosc multithreading#

BloscCodec can use multiple threads internally, which helps when compressing large chunks. By default it uses a single thread; pass nthreads=4 (or similar) if your workload benefits from parallel compression:

from zarr.codecs import BloscCodec

compressor = BloscCodec(cname="zstd", clevel=5, nthreads=4)

Chunk size tuning#

The chunk_size parameter in ZarrArrayConfig controls the chunk length along dimension 0 of the stored array. Other dimensions use the full extent. Because atom-level fields (positions, forces, atomic_numbers) are stored concatenated along the atom axis — not per structure — dimension 0 is the total-atoms axis, not the number of structures.

Target chunk size#

The Zarr documentation recommends chunks of at least 1 MB uncompressed for good throughput, particularly when using Blosc. Smaller chunks increase per-chunk overhead (metadata, system calls, compression dictionary resets). Larger chunks reduce the number of I/O operations for sequential reads but increase read amplification for random access — reading a single 50-atom structure (600 bytes of positions) from a 1 MB chunk wastes 99.9 % of the decompressed data.

Access pattern	Recommended chunk target	Rationale
Sequential DataLoader	1–4 MB	Amortises overhead across many samples
Trajectory capture (append, then sequential read)	1 MB	Balances write latency and read throughput
Random access (visualisation, single-sample lookup)	64–256 KB	Limits read amplification

Note

Zarr v3 supports sharding, which decouples the read unit (chunk) from the storage unit (shard). With sharding you can have small chunks for fine-grained random access grouped into large shards for filesystem efficiency. Set shard_size on ZarrArrayConfig to enable it — the shard size must be a multiple of the chunk size.

Back-of-the-envelope formula#

For a stored array whose rows have trailing_dims trailing dimensions and dtype size d bytes:

\[ \text{bytes\_per\_row} = d \times \prod(\text{trailing\_dims}) \]

\[ \text{chunk\_size} = \left\lfloor \frac{\text{target\_bytes}}{\text{bytes\_per\_row}} \right\rfloor \]

The following table gives concrete values for common arrays:

Array	Trailing dims	Dtype	Bytes/row	chunk_size (1 MB)	chunk_size (4 MB)
positions `[V, 3]`	3	float32	12	83,333	333,333
forces `[V, 3]`	3	float32	12	83,333	333,333
atomic_numbers `[V]`	1	int64	8	125,000	500,000
energy `[B]`	1	float64	8	125,000	500,000
cell `[B, 3, 3]`	9	float32	36	27,778	111,111
neighbor_list `[E, 2]`	2	int64	16	62,500	250,000
shifts `[E, 3]`	3	float32	12	83,333	333,333

Example: positions (float32, shape [V, 3]), 1 MB target

\[ \text{bytes\_per\_row} = 3 \times 4 = 12 \text{ bytes} \]

\[ \text{chunk\_size} = \left\lfloor \frac{1{,}000{,}000}{12} \right\rfloor = 83{,}333 \]

Example: energy (float64, shape [B]), 1 MB target

\[ \text{bytes\_per\_row} = 1 \times 8 = 8 \text{ bytes} \]

\[ \text{chunk\_size} = \left\lfloor \frac{1{,}000{,}000}{8} \right\rfloor = 125{,}000 \]

Read amplification#

When reading a single structure by index, the reader fetches the slice positions[atoms_ptr[i]:atoms_ptr[i+1], :] — typically ~50 rows (600 bytes). With large chunks, most of the decompressed data is discarded:

chunk_size	Chunk bytes (positions)	Amplification (50-atom read)
333,333	4 MB	6,667×
83,333	1 MB	1,667×
10,000	120 KB	200×

For purely sequential workloads (sequential DataLoader) amplification does not matter — every row is consumed. For random-access workloads, prefer smaller chunks or consider field overrides for frequently accessed arrays.

Warning

Atom-level fields (positions, forces, atomic_numbers) are stored as concatenated arrays of shape [V_total, ...] where V_total is the sum of atoms across all structures. The chunk_size parameter controls the number of rows in each chunk, not the number of structures. System-level fields (energy, cell, pbc) have one row per structure, so chunk_size directly equals the number of structures per chunk.

Storage estimation#

The tables below assume 50 atoms per structure on average with ~200 edges (a typical cutoff-based neighbour list). Edge arrays dominate storage; many workflows recompute edges at load time via neighbour lists and omit them from the store.

Per-array breakdown (100k structures)#

Array	Shape	Dtype	Uncompressed
positions	[5M, 3]	float32	60 MB
forces	[5M, 3]	float32	60 MB
atomic_numbers	[5M]	int64	40 MB
energy	[100k]	float64	0.8 MB
cell	[100k, 3, 3]	float32	3.6 MB
pbc	[100k, 3]	bool	0.3 MB
stress	[100k, 3, 3]	float32	3.6 MB
virial	[100k, 3, 3]	float32	3.6 MB
dipole	[100k, 3]	float32	1.2 MB
neighbor_list	[20M, 2]	int64	320 MB
shifts	[20M, 3]	float32	240 MB
metadata (ptrs, masks)	—	mixed	27 MB
Total (with edges)			760 MB
Total (without edges)			200 MB

Scaling by dataset size#

Component	100k	1M	10M
Node + system core	173 MB	1.7 GB	17 GB
Edge arrays	560 MB	5.6 GB	56 GB
Metadata	27 MB	267 MB	2.7 GB
Total (with edges)	760 MB	7.6 GB	76 GB
Total (without edges)	200 MB	2.0 GB	20 GB

With compression#

Codec	Typical ratio	100k	1M	10M
Zstd (level 3)	2–4×	190–380 MB	1.9–3.8 GB	19–38 GB
LZ4	1.5–2.5×	300–510 MB	3.0–5.1 GB	30–51 GB

Note

Actual ratios depend heavily on data characteristics. Smooth MD trajectories (correlated frames) compress 4–6×; random equilibrium structures compress 2–3×. Integer arrays (atomic numbers, pointers) often compress 5–10× due to repetition. The estimates above include edge arrays; without edges, divide by ~3.8.

The I/O benchmark tool uses purely random tensors, so its measured ratios (~1.75× Zstd, ~1.63× LZ4) represent a worst case. Real molecular data will compress significantly better.

File count#

Without sharding, each chunk becomes a separate file on local stores. A Zarr store also contains one zarr.json metadata file per array and per group, so the total file count across the whole store is the sum of chunk files for every array plus metadata files (~20 for a typical store).

The table below shows chunk files per array for the positions array ([V_total, 3] float32), which is representative of other atom-level arrays:

chunk_size	100k (V = 5M)	1M (V = 50M)	10M (V = 500M)
83,333 (1 MB)	61	601	6,001
10,000 (120 KB)	500	5,000	50,000

A typical store has ~10 chunked arrays, so multiply by ~10 for total chunk files, then add ~20 metadata files. At 100k systems with chunk_size=10,000, the TUI reports ~4,500 total files; at 100k with chunk_size=83,333, it reports ~690 total files.

With sharding (shard_size=500,000, chunk_size=10,000), the same 100k-system store drops to ~160 total files — a 28× reduction — because each shard file bundles 50 chunks.

Filesystem metadata overhead becomes significant above ~10,000 files per array. If you need small chunks for random access at scale, enable sharding with shard_size or use a cloud object store (S3, GCS via FsspecStore).

Recipes#

Recipe 1: Sequential dataset (best compression)#

Prioritise disk space over write speed. Use Zstd at a moderate level with large chunks (~1 MB per chunk) for sequential reads.

from nvalchemi.data.datapipes import ZarrWriteConfig, ZarrArrayConfig
from nvalchemi.data.datapipes.backends.zarr import AtomicDataZarrWriter
from zarr.codecs import ZstdCodec

config = ZarrWriteConfig(
    core=ZarrArrayConfig(
        compressors=(ZstdCodec(level=5),),
        chunk_size=100_000,   # ~1.2 MB chunks for positions [V,3] f32
    ),
)
writer = AtomicDataZarrWriter("/data/example.zarr", config=config)

Recipe 2: Dynamics trajectory (fast I/O)#

Prioritise write throughput for real-time trajectory capture. Use LZ4 with moderate chunks (~120 KB) to balance write latency and random-access readback.

from nvalchemi.dynamics.sinks import ZarrData
from nvalchemi.data.datapipes import ZarrWriteConfig, ZarrArrayConfig
from zarr.codecs import BloscCodec

config = ZarrWriteConfig(
    core=ZarrArrayConfig(
        compressors=(BloscCodec(cname="lz4"),),
        chunk_size=10_000,    # ~120 KB chunks for positions [V,3] f32
    ),
)
sink = ZarrData("/tmp/trajectory.zarr", config=config)

Recipe 3: Per-field override (mixed access patterns)#

Use Zstd for most arrays but LZ4 with smaller chunks for positions (frequently accessed for visualisation or neighbour list rebuilds).

from nvalchemi.data.datapipes import ZarrWriteConfig, ZarrArrayConfig
from nvalchemi.data.datapipes.backends.zarr import AtomicDataZarrWriter
from zarr.codecs import ZstdCodec, BloscCodec

config = ZarrWriteConfig(
    core=ZarrArrayConfig(
        compressors=(ZstdCodec(level=3),),
        chunk_size=100_000,   # 1 MB chunks for sequential core arrays
    ),
    field_overrides={
        "positions": ZarrArrayConfig(
            compressors=(BloscCodec(cname="lz4"),),
            chunk_size=50_000,  # ~600 KB: smaller for random access
        ),
    },
)
writer = AtomicDataZarrWriter("/data/mixed.zarr", config=config)

Recipe 4: Sparse data (skip empty chunks)#

For datasets with many optional fields or sparse validity masks, disable writing empty chunks to save space.

from nvalchemi.data.datapipes import ZarrWriteConfig, ZarrArrayConfig
from nvalchemi.data.datapipes.backends.zarr import AtomicDataZarrWriter
from zarr.codecs import ZstdCodec

config = ZarrWriteConfig(
    core=ZarrArrayConfig(
        compressors=(ZstdCodec(level=3),),
        write_empty_chunks=False,
    ),
    custom=ZarrArrayConfig(
        compressors=(ZstdCodec(level=3),),
        write_empty_chunks=False,
    ),
)
writer = AtomicDataZarrWriter("/data/sparse.zarr", config=config)

Tip

write_empty_chunks=False is especially useful for custom arrays that are only populated for a subset of structures. Zarr will skip writing chunks that contain only the fill value, reducing both disk usage and write time.

Recipe 5: Sharded storage (large datasets)#

For datasets with millions of structures, use sharding to keep small read-friendly chunks while reducing the number of storage objects. The shard size must be a multiple of the chunk size.

from nvalchemi.data.datapipes import ZarrWriteConfig, ZarrArrayConfig
from nvalchemi.data.datapipes.backends.zarr import AtomicDataZarrWriter
from zarr.codecs import ZstdCodec

config = ZarrWriteConfig(
    core=ZarrArrayConfig(
        compressors=(ZstdCodec(level=3),),
        chunk_size=10_000,     # 120 KB chunks for random access
        shard_size=500_000,    # 50 chunks per shard, ~6 MB per shard
    ),
)
writer = AtomicDataZarrWriter("/data/large.zarr", config=config)

Tip

Sharding is particularly valuable on local filesystems with large datasets where file count can become a bottleneck. With 10M structures and chunk_size=10,000, you would get 50,000 files per array without sharding versus only 1,000 shard files with shard_size=500,000.

I/O benchmark tool#

The toolkit ships a command-line benchmark for measuring Zarr write throughput and compression ratios on synthetic data. Use it to validate configuration choices before committing to a production workflow.

Running the benchmark#

# Install (if not already)
$ uv sync --all-extras

# Basic: compare codec overhead across dataset sizes
$ nvalchemi-io-test -n 1000 -n 10000 --codec zstd --level 3 --chunk-size 83333

# Fast codec with smaller chunks for trajectory-style workloads
$ nvalchemi-io-test -n 1000 -n 10000 --codec lz4 --chunk-size 10000

# Larger molecules with edge-specific chunking
$ nvalchemi-io-test -n 1000 -n 10000 --min-atoms 100 --max-atoms 500 \
    --codec zstd --chunk-size 83333 --edge-chunk-size 62500

# With sharding enabled
$ nvalchemi-io-test -n 1000 -n 10000 --codec zstd \
    --chunk-size 1000 --shard-size 10000

Key options:

Option	Default	Description
`-n` / `--num-systems`	1000 10000 100000	Dataset sizes to benchmark (repeatable)
`--min-atoms`	10	Minimum atoms per structure
`--max-atoms`	100	Maximum atoms per structure
`--codec`	—	Compression codec: `zstd`, `lz4`, or `blosc-zstd`
`--level`	3	Compression level
`--chunk-size`	—	Chunk size for node/system arrays
`--shard-size`	—	Shard size for node/system arrays
`--edge-chunk-size`	—	Chunk size for edge arrays (neighbor_list, shifts)
`--edge-shard-size`	—	Shard size for edge arrays

Example output#

Small molecules (10–100 atoms), Zstd level 3, 1 MB chunks:

nvalchemi Zarr I/O benchmark  atoms=10-100  config=zstd L3, chunk=83,333,
                                             edge_chunk=62,500
Pre-computed: 100,000 systems, 5,504,449 total atoms (avg 55.0),
              11,062,584 total edges (avg 110.6)
Estimated uncompressed: 484.9 MB

      Zarr I/O Benchmark — zstd L3, chunk=83,333, edge_chunk=62,500

              Avg     Avg      Raw     Disk                   Write
  Systems   atoms   edges     size     size  Ratio  Files      time  Systems/s
 ─────────────────────────────────────────────────────────────────────────────
    1,000      56     115   4.8 MB   2.8 MB  1.74x     36     0.14s     7,282
   10,000      55     112  47.1 MB  27.0 MB  1.75x     96     0.48s    20,736
  100,000      55     111 467.5 MB 267.7 MB  1.75x    691     4.66s    21,471

Small molecules, LZ4, 120 KB chunks (trajectory-optimised):

nvalchemi Zarr I/O benchmark  atoms=10-100  config=lz4 L3, chunk=10,000,
                                             edge_chunk=10,000

      Zarr I/O Benchmark — lz4 L3, chunk=10,000, edge_chunk=10,000

              Avg     Avg      Raw     Disk                   Write
  Systems   atoms   edges     size     size  Ratio  Files      time  Systems/s
 ─────────────────────────────────────────────────────────────────────────────
    1,000      56     115   4.8 MB   3.0 MB  1.61x     76     0.12s     8,207
   10,000      55     112  47.1 MB  28.9 MB  1.63x    480     0.80s    12,446
  100,000      55     111 467.5 MB 287.5 MB  1.63x  4,509     8.10s    12,341

Small molecules, sharded (chunk=10,000 inside shard=500,000):

nvalchemi Zarr I/O benchmark  atoms=10-100  config=chunk=10,000,
    shard=500,000, edge_chunk=10,000, edge_shard=500,000

      Zarr I/O Benchmark — chunk=10,000, shard=500,000,
                            edge_chunk=10,000, edge_shard=500,000

              Avg     Avg      Raw     Disk                   Write
  Systems   atoms   edges     size     size  Ratio  Files      time  Systems/s
 ─────────────────────────────────────────────────────────────────────────────
    1,000      56     115   4.8 MB   2.8 MB  1.73x     34     0.14s     6,998
   10,000      55     112  47.1 MB  27.0 MB  1.74x     46     0.63s    15,930
  100,000      55     111 467.5 MB 268.2 MB  1.74x    158     6.46s    15,471

Note the dramatic file count reduction with sharding: 4,509 → 158 at 100k systems with the same chunk size, while compression ratio and disk size remain essentially unchanged.

Larger molecules (100–500 atoms), Zstd with edge-specific chunks:

nvalchemi Zarr I/O benchmark  atoms=100-500  config=zstd L3, chunk=83,333,
                                              edge_chunk=62,500
Pre-computed: 10,000 systems, 3,016,657 total atoms (avg 301.7),
              6,073,861 total edges (avg 607.4)
Estimated uncompressed: 263.5 MB

      Zarr I/O Benchmark — zstd L3, chunk=83,333, edge_chunk=62,500

              Avg     Avg      Raw     Disk                   Write
  Systems   atoms   edges     size     size  Ratio  Files      time  Systems/s
 ─────────────────────────────────────────────────────────────────────────────
    1,000     303     615  25.7 MB  15.4 MB  1.67x     66     0.21s     4,737
   10,000     302     607 254.7 MB 152.9 MB  1.67x    394     1.23s     8,138

Note

Zarr v3 defaults to ZstdCodec(level=0) when no compressor is specified. The “Raw size” column reflects the data as written by the toolkit (including Zarr metadata overhead), so even runs without an explicit --codec flag will show some compression.

Tip

Run with --min-atoms and --max-atoms matching your actual dataset to get realistic estimates. The benchmark uses uniform random atom counts; real-world distributions may be skewed toward smaller or larger structures.

Zarr Compression Tuning#

Quick start#

Configuration hierarchy#

Field overrides#

Codec comparison#

Blosc multithreading#

Chunk size tuning#

Target chunk size#

Back-of-the-envelope formula#

Read amplification#

Storage estimation#

Per-array breakdown (100k structures)#

Scaling by dataset size#

With compression#

File count#

Recipes#

Recipe 1: Sequential dataset (best compression)#

Recipe 2: Dynamics trajectory (fast I/O)#

Recipe 3: Per-field override (mixed access patterns)#

Recipe 4: Sparse data (skip empty chunks)#

Recipe 5: Sharded storage (large datasets)#

I/O benchmark tool#

Running the benchmark#

Example output#

See also#