Benchmarks#

Performance benchmarks for ALCHEMI Toolkit-Ops GPU kernels, covering both the Torch and JAX backends. These pages present scaling behaviour across system sizes, constant-workload sweeps, and batch-scaling modes for each module.

Methodology#

Timing#

Backend	Mechanism	Details
Torch	CUDA Events	`start_event.record()` / `end_event.record()` around the kernel call, then `torch.cuda.synchronize()`. No host-side sync inside the timed loop.
JAX	Wall-clock + `block_until_ready`	Warmups block to exclude tracing/compilation. The standard timed loop dispatches all calls and blocks on the final result; allocation-heavy rows may retry with a block after each call. Rows record the exact `timing_method`.

The shipped benchmark YAMLs use the publication timing protocol by default: three warmup iterations and ten timed iterations on the full benchmark grid. Both values are configurable via parameters.warmup_runs and parameters.timing_runs, or via the shared --warmup-runs / --timing-runs CLI flags. The reported value is the mean steady-state time per call after warmup/compile/load work is excluded.

Scaling modes#

Each benchmark page presents three scaling modes:

System Size Scaling : Fix batch size = 1, sweep the number of atoms per system.

Constant Workload : Target a total atom count while sweeping batch size (fewer atoms per system, more systems). Discrete system sizes and integer batch counts mean the actual total can be below the target, especially for CsCl supercells. Plots use the actual total_atoms recorded in each CSV row.

Batch Scaling : Fix atoms per system, sweep batch size. Reveals how well the kernel amortises fixed overhead across systems.

Systems#

System	Source	Description
CsCl	Programmatic supercell builder in `benchmarks.suite_systems`	CsCl/B2 primitive-cubic supercells (2 atoms/unit cell, cubic 4.119 Å).
NH₃	PDB files generated by `benchmarks/nh3/generate_pbc_pdbs.sh` (requires packmol)	Periodic ammonia boxes at powers-of-two atom counts.

Hardware#

GPU	VRAM	Architecture
NVIDIA H100 80GB HBM3	80 GB HBM3	Hopper

Warning

These results are intended to be indicative only: your actual performance may vary depending on the atomic system topology, software and hardware configuration. We encourage users to benchmark on their own systems of interest.

Note

Precision: NL and DFT-D3 use float32; electrostatics (PME / Ewald) use float64; charges are always float64.

Missing plotted points are rows where the benchmark wrote success=False with a concrete error_type such as OutOfMemoryError, SkippedByPolicy, SkippedAfterOOM, or UnsupportedConfiguration. The latter records a public API capability or launch-safety guard; it does not substitute another method. Hardware-dependent OOM limits are not hardcoded into the suite; failed and skipped rows remain in the CSVs, are not drawn as data, and break plotted lines at their coordinates.

Successful JAX rows timed with per-call blocking are shown as separate line segments from batch-dispatch timings; the CSV timing_method column records the boundary used for every point.

Running the full suite#

The three modules share a single entry point, benchmarks.benchmark_suite, which loads each per-module benchmark_config.yaml and dispatches in-process. Use a fresh scratch directory for each reportable run, then promote a complete, reviewed CSV set into the documentation only after validation.

Before publishing plots, the suite checks case coverage and confirms that the run, source, hardware, inputs, and runtime settings belong together. Backend reruns replace only that backend’s rows inside the same validated run rather than mixing unrelated CSV snapshots.

Published snapshot#

The bundled H100 snapshot uses the full YAML grids with three warmups and ten timed iterations. Failed cases remain explicit CSV rows; plots leave those coordinates empty and do not connect lines across them. Collection may be sharded across compatible GPUs; CSV timings always represent steady-state calls, not queue, compilation, or environment setup.

See Benchmark Results for the current run ID, row and failure counts, CSV schema, scratch-only cluster setup, sharding rules, and the detailed reproducibility runbook. Module-specific commands remain on the neighbor-list, DFT-D3, and electrostatics pages.