<!-- markdownlint-disable MD014 -->

(dynamics_guide)=

# Dynamics: Optimization and Molecular Dynamics

The dynamics module provides a unified framework for running geometry optimizations
and molecular dynamics simulations on GPU. All simulation types share a common
execution loop --- hooks, model evaluation, convergence checking --- so you learn the
pattern once and apply it to any integrator.

```{tip}
It is important to keep in mind that ``nvalchemi`` follows a batch-first principle:
users should think and reason about dynamics workflows with multiple structures
simultaneously, as opposed to individual structures being processed sequentially.
```

## The execution loop

Every simulation is driven by {py:class}`~nvalchemi.dynamics.base.BaseDynamics`,
which defines a single `step()` that all integrators and optimizers follow. The
loop is broken into discrete stages, enumerated by
{py:class}`~nvalchemi.dynamics.base.HookStageEnum`:

| Stage | When it fires |
|-------|---------------|
| `BEFORE_STEP` | At the very beginning of a step, before any operations |
| `BEFORE_PRE_UPDATE` | Just before the integrator's first half-step |
| `AFTER_PRE_UPDATE` | After the first half-step completes |
| `BEFORE_COMPUTE` | Just before the model forward pass |
| `AFTER_COMPUTE` | After the model forward pass completes |
| `BEFORE_POST_UPDATE` | Just before the integrator's second half-step |
| `AFTER_POST_UPDATE` | After the second half-step completes |
| `AFTER_STEP` | At the very end of a step, after all operations |
| `ON_CONVERGE` | When a convergence criterion is met |

A single call to `step()` proceeds through these stages in order:

1. **BEFORE_STEP** hooks fire.
2. `pre_update(batch)` --- the integrator's first half-step (e.g. update velocities
   by half a timestep), bracketed by BEFORE/AFTER_PRE_UPDATE hooks.
3. `compute(batch)` --- the wrapped ML model evaluates forces (and stresses, if
   needed), bracketed by BEFORE/AFTER_COMPUTE hooks.
4. `post_update(batch)` --- the integrator's second half-step (e.g. complete the
   velocity update with the new forces), bracketed by BEFORE/AFTER_POST_UPDATE hooks.
5. **AFTER_STEP** hooks fire (convergence checks, logging, ...).
6. Convergence is evaluated: converged systems fire **ON_CONVERGE** hooks and (in
   multi-stage pipelines) migrate to the next stage.

`run(batch, n_steps)` calls `step()` in a loop until all systems converge or
`n_steps` is reached. Every hook declares which
{py:class}`~nvalchemi.dynamics.base.HookStageEnum` stage it should fire at and at
what frequency, so you have fine-grained control over when callbacks execute.

## Using dynamics as a context manager

All dynamics objects (optimizers, integrators, fused stages) support Python's
context manager protocol. The `with` block manages a dedicated
`torch.cuda.Stream` for the simulation and ensures hooks are properly opened and
closed:

```python
from nvalchemi.dynamics import FIRE, ConvergenceHook

with FIRE(model=model, dt=0.1, n_steps=500, hooks=[ConvergenceHook(fmax=0.05)]) as opt:
    relaxed = opt.run(batch)
```

When you call `run()` without a `with` block, hook setup and teardown happen
automatically inside `run()`. The context manager form is useful when you need to
call `step()` manually or interleave dynamics with other operations while keeping
hook state (e.g. open log files) alive.

## Multi-stage pipelines with FusedStage

Real workflows often chain multiple simulation phases: relax a structure, then run
MD at increasing temperatures, then relax again. The
{py:class}`~nvalchemi.dynamics.base.FusedStage` abstraction lets you compose stages
with the `+` operator:

```python
from nvalchemi.dynamics import FIRE, NVTLangevin, ConvergenceHook

relax = FIRE(model=model, dt=0.1, n_steps=200, hooks=[ConvergenceHook(fmax=0.05)])
md = NVTLangevin(model=model, dt=1.0, temperature=300.0, n_steps=5000)

pipeline = relax + md
with pipeline:
  pipeline.run(batch)
```

Systems start in the first stage (relaxation). As each system converges, it
automatically migrates to the next stage (MD). Different systems can be in different
stages simultaneously --- the batch is partitioned internally, and a single model
forward pass is shared across all active systems regardless of which stage they
belong to.

### Compiling with `torch.compile`

{py:class}`~nvalchemi.dynamics.base.FusedStage` can compile its entire step function
with `torch.compile` to reduce Python overhead and enable kernel fusion. Call
{py:meth}`~nvalchemi.dynamics.base.FusedStage.compile` after composing stages:

```python
fused = (relax + md).compile(fullgraph=True)
with fused:
    fused.run(batch)
```

`compile()` wraps the internal `_step_impl` method --- which includes hook dispatch,
masked sub-stage updates, and the shared model forward pass --- in a single compiled
graph. It returns the same instance, so you can chain it fluently.

You can also defer compilation by passing `compile_step=True` at construction time.
In that case, `torch.compile` is invoked lazily when the context manager is entered:

```python
fused = relax + md  # compile_step inherited from sub-stages or set explicitly
with fused:         # compilation happens here
    fused.run(batch)
```

Any keyword arguments accepted by `torch.compile` (e.g. `fullgraph`, `mode`,
`backend`) can be passed to `.compile()` or stored via `compile_kwargs` at
construction.

```{note}
Not all hooks are graph-break-free under `fullgraph=True`. Hooks that perform
Python-side control flow (e.g. logging, I/O) will introduce graph breaks. If you
need an unbroken graph, ensure your hooks are written with torch-compatible
operations only.
```

## Distributed pipelines

When a workflow needs more than one GPU --- for example, relaxing structures on one
device and running MD on another --- the
{py:class}`~nvalchemi.dynamics.base.DistributedPipeline` distributes stages across
ranks. Where `+` fuses stages onto a single GPU, the `|` operator (or a `stages`
dictionary) assigns one stage per rank and wires up inter-rank communication
automatically.

### Configuring a pipeline

Each rank owns a {py:class}`~nvalchemi.dynamics.base.BaseDynamics` (or
{py:class}`~nvalchemi.dynamics.base.FusedStage`) instance. Stages are collected in a
dictionary keyed by global rank and handed to
{py:class}`~nvalchemi.dynamics.base.DistributedPipeline`:

```python
from nvalchemi.dynamics import FIRE, NVTLangevin, DistributedPipeline
from nvalchemi.dynamics.base import BufferConfig

buffer_cfg = BufferConfig(num_systems=4, num_nodes=50, num_edges=0)

stages = {
    0: FIRE(model=model, buffer_config=buffer_cfg, ...),        # upstream — relaxation
    1: NVTLangevin(model=model, buffer_config=buffer_cfg, ...),  # downstream — MD
}

pipeline = DistributedPipeline(stages=stages, backend="nccl")
with pipeline:
    pipeline.run()
```

By default, `setup()` (called automatically by the context manager) sorts stages by
rank and wires `prior_rank` / `next_rank` between adjacent stages as a simple linear
chain. For more sophisticated topologies --- such as multiple independent
sub-pipelines running in the same job --- set `prior_rank` and `next_rank` explicitly
on each stage:

```python
stages = {
    # Sub-pipeline A: rank 0 → rank 1
    0: FIRE(model=model, buffer_config=buffer_cfg, prior_rank=None, next_rank=1, ...),
    1: NVTLangevin(model=model, buffer_config=buffer_cfg, prior_rank=0, next_rank=None, ...),
    # Sub-pipeline B: rank 2 → rank 3
    2: FIRE(model=model, buffer_config=buffer_cfg, prior_rank=None, next_rank=3, ...),
    3: NVTLangevin(model=model, buffer_config=buffer_cfg, prior_rank=2, next_rank=None, ...),
}
```

The first stage in each sub-pipeline typically owns a *sampler* that feeds new
structures into the chain; the last stage owns one or more *data sinks* that collect
converged results.

```{note}
Each rank currently communicates with at most one upstream and one downstream
neighbour (one-to-one topology). Fan-out (one-to-many) and fan-in (many-to-one)
patterns are planned for a future release.
```

### Sizing the buffer

NCCL point-to-point transfers require fixed-size tensors, so each communicating stage
pre-allocates a send buffer and a receive buffer whose dimensions are set by
{py:class}`~nvalchemi.dynamics.base.BufferConfig`. The three fields control how much
data a single transfer can carry:

| Field | What it controls |
|-------|------------------|
| `num_systems` | Maximum number of graphs (structures) per transfer. Determines throughput per step --- higher values move more data but consume more GPU memory. |
| `num_nodes` | Total atom capacity across all graphs in the buffer. Must be large enough for the worst-case combination of systems. For example, transferring up to 4 structures of at most 50 atoms each requires `num_nodes=200`. |
| `num_edges` | Total edge capacity. Set to **0** when the downstream model recomputes edges via its neighbor list (the common case). Only set a non-zero value if pre-computed edge attributes must be transferred. |

```python
from nvalchemi.dynamics.base import BufferConfig

# 4 structures, up to 200 atoms total, edges recomputed downstream
buffer_cfg = BufferConfig(num_systems=4, num_nodes=200, num_edges=0)
```

When the upstream stage has more converged samples than `num_systems` allows in a
single transfer, the excess stays in the active batch as a no-op until the next
step --- this is the back-pressure mechanism described below.

```{important}
Every pair of communicating stages **must** share an identical
{py:class}`~nvalchemi.dynamics.base.BufferConfig`.
`DistributedPipeline.setup()` validates this and raises an error on mismatch.
```

### Buffer synchronization

The diagram below shows how two adjacent ranks exchange data through pre-allocated
send and receive buffers during a single step. The upstream rank pushes converged
samples; the downstream rank pulls them into its active batch.

```{graphviz}
:caption: Buffer synchronization between two adjacent ranks in a DistributedPipeline.

digraph buffer_sync {
    rankdir=LR
    compound=true
    fontname="Helvetica"
    node [fontname="Helvetica" fontsize=11]
    edge [fontname="Helvetica" fontsize=10]

    subgraph cluster_upstream {
        label="Rank 0  (upstream)"
        style=rounded
        color="#4a90d9"
        fontcolor="#4a90d9"

        u_batch [label="active_batch" shape=box style=filled fillcolor="#dce6f1"]
        u_send  [label="send_buffer"  shape=box style=filled fillcolor="#f9e2ae"]
        u_sinks [label="sinks\n(overflow)" shape=box style=dashed]

        u_batch -> u_send [label="converged\nsamples" style=bold]
        u_batch -> u_sinks [label="excess\n(back-pressure)" style=dotted]
    }

    subgraph cluster_downstream {
        label="Rank 1  (downstream)"
        style=rounded
        color="#5bb35b"
        fontcolor="#5bb35b"

        d_recv  [label="recv_buffer"  shape=box style=filled fillcolor="#f9e2ae"]
        d_batch [label="active_batch" shape=box style=filled fillcolor="#dce6f1"]
        d_sinks [label="sinks\n(results)" shape=box style=dashed]

        d_recv -> d_batch [label="incoming\nsamples" style=bold]
        d_batch -> d_sinks [label="converged\nresults" style=bold]
        d_sinks -> d_batch [label="drain when\ncapacity available" style=dotted]
    }

    u_send -> d_recv [label="isend / irecv\n(NCCL)" style=bold color="#c0392b" fontcolor="#c0392b" penwidth=2]
}
```

A step proceeds as follows:

1. **Pre-step** --- The downstream rank zeros its receive buffer and posts an
   asynchronous `irecv` from its `prior_rank`. In `async_recv` mode (the default),
   the wait is deferred until later in the step; in `sync` mode it blocks
   immediately.
2. **Complete receive** --- The downstream rank waits on the pending receive,
   then routes incoming samples into its active batch (or overflow sinks if the
   batch is full).
3. **Step** --- Both ranks execute their respective integrator or optimizer on their
   active batches.
4. **Post-step** --- The upstream rank identifies converged samples, copies them into
   its send buffer (up to `BufferConfig` capacity), and issues an `isend`. An empty
   buffer is always sent to prevent deadlocks. The final stage routes converged
   samples to its sinks instead.

```{tip}
**Back-pressure**: when the send buffer is full, excess converged samples remain in
the upstream active batch as no-ops until buffer capacity opens up. This naturally
throttles fast producers without dropping data.
```

### Communication modes

The `comm_mode` parameter controls how aggressively communication overlaps with
computation:

| Mode | Behavior |
|------|----------|
| `sync` | Blocks on `irecv` immediately in the pre-step. Simplest to debug. |
| `async_recv` *(default)* | Posts `irecv` early, waits only when the data is needed. Overlaps receive with computation. |
| `fully_async` | Also defers `isend` completion to the next step's pre-step. Maximum overlap, highest throughput. |

### Launching

Distributed pipelines are launched with `torchrun` (or any `torch.distributed`
launcher):

```bash
torchrun --nproc_per_node=2 my_pipeline.py
```

`DistributedPipeline` calls `init_distributed()` on entry and coordinates
termination across ranks via an `all_reduce` on per-rank done flags.

```{seealso}
The {doc}`/examples/distributed/index` gallery contains end-to-end examples,
including multi-pipeline topologies and monitoring with persistent storage.
```

## What's next

```{toctree}
:maxdepth: 1

dynamics_simulations
dynamics_hooks
dynamics_sinks
```

- [Optimization and Integrators](dynamics_simulations) --- FIRE, NVE, NVT, NPT and
  their configuration.
- [Hooks](dynamics_hooks) --- the hook protocol, built-in hooks, and writing custom
  hooks.
- [Data Sinks](dynamics_sinks) --- recording trajectories and simulation results.

## See also

- **Examples**: ``02_dynamics_example.py`` demonstrates a complete relaxation and MD
  workflow.
- **API**: See the {py:mod}`nvalchemi.dynamics` module for the full reference,
  including the hook protocol and distributed pipeline documentation.
- **Data guide**: The [AtomicData and Batch](data_guide) guide covers the input data
  structures consumed by dynamics.