Checkpoint#

class earth2studio.utils.checkpoint.Checkpoint(
name,
path=None,
mode='overwrite',
flush_interval=1,
history_size=None,
level=2,
rank=None,
world_size=None,
device=device(type='cpu'),
)[source]#

Catalog of restart checkpoints for a named inference run.

A checkpoint owns the on-disk catalog and commit directories for one logical inference run. Each committed catalog row stores generic workflow metadata and component state bound through bind_checkpoint_state().

Parameters:
  • name (str) – Name of the checkpoint store. Used in the default checkpoint path and in commit manifests.

  • path (str | Path | None, optional) – Directory where checkpoint files are stored. If None, checkpoints are written under the default Earth2Studio cache location, by default None.

  • mode (Literal["overwrite", "append"], optional) – Catalog update mode. "overwrite" keeps only the latest row; "append" preserves row history for repeated writes, by default “overwrite”.

  • flush_interval (int | None, optional) – Number of workflow write calls between durable commits. Use None to keep updates pending until flush is called explicitly, by default 1.

  • history_size (int | None, optional) – Maximum number of rows to keep when mode="append". In overwrite mode this is ignored and treated as one latest row, by default None.

  • level (CheckpointLevel, optional) – Requested component logging level. 0 records workflow progress and explicit metadata only, 1 allows component state needed to restart workflow items such as ensemble members, and 2 allows component state needed to resume inside a forecast rollout, by default 2.

  • rank (int | None, optional) – Distributed rank for this process. If None, Earth2Studio attempts to detect the rank from PhysicsNeMo or common distributed environment variables, by default None.

  • world_size (int | None, optional) – Distributed world size. If None, Earth2Studio attempts to detect the world size from PhysicsNeMo or WORLD_SIZE, by default None.

  • device (str | torch.device, optional) – Device where components should stage live tensor state before it is serialized into the checkpoint, by default torch.device(“cpu”).

property active: CheckpointSession | None#

Active checkpoint selected from this catalog, if one is in scope.

property catalog: tuple[CheckpointEntry, ...]#

Committed checkpoint entries for the current rank.

property rank_path: Path#

Directory for this process checkpoint writes.

refresh()[source]#

Refresh the checkpoint catalog from disk.

Return type:

None

select(row)[source]#

Select an existing checkpoint row by position.

The selected session can be used as a context manager. Bound component state and metadata are restored from the selected catalog row when the session becomes active.

Parameters:

row (int) – Positional row index. Negative indexing is supported.

Returns:

Session representing the selected catalog row.

Return type:

CheckpointSession

Raises:

IndexError – If no catalog row matches the selection.