Checkpoint#
- class earth2studio.utils.checkpoint.Checkpoint(
- name,
- path=None,
- mode='overwrite',
- flush_interval=1,
- history_size=None,
- level=2,
- rank=None,
- world_size=None,
- device=device(type='cpu'),
Catalog of restart checkpoints for a named inference run.
A checkpoint owns the on-disk catalog and commit directories for one logical inference run. Each committed catalog row stores generic workflow metadata and component state bound through
bind_checkpoint_state().- Parameters:
name (str) – Name of the checkpoint store. Used in the default checkpoint path and in commit manifests.
path (str | Path | None, optional) – Directory where checkpoint files are stored. If
None, checkpoints are written under the default Earth2Studio cache location, by default None.mode (Literal["overwrite", "append"], optional) – Catalog update mode.
"overwrite"keeps only the latest row;"append"preserves row history for repeated writes, by default “overwrite”.flush_interval (int | None, optional) – Number of workflow
writecalls between durable commits. UseNoneto keep updates pending untilflushis called explicitly, by default 1.history_size (int | None, optional) – Maximum number of rows to keep when
mode="append". In overwrite mode this is ignored and treated as one latest row, by default None.level (CheckpointLevel, optional) – Requested component logging level.
0records workflow progress and explicit metadata only,1allows component state needed to restart workflow items such as ensemble members, and2allows component state needed to resume inside a forecast rollout, by default 2.rank (int | None, optional) – Distributed rank for this process. If
None, Earth2Studio attempts to detect the rank from PhysicsNeMo or common distributed environment variables, by default None.world_size (int | None, optional) – Distributed world size. If
None, Earth2Studio attempts to detect the world size from PhysicsNeMo orWORLD_SIZE, by default None.device (str | torch.device, optional) – Device where components should stage live tensor state before it is serialized into the checkpoint, by default torch.device(“cpu”).
- property active: CheckpointSession | None#
Active checkpoint selected from this catalog, if one is in scope.
- property catalog: tuple[CheckpointEntry, ...]#
Committed checkpoint entries for the current rank.
- property rank_path: Path#
Directory for this process checkpoint writes.
- select(row)[source]#
Select an existing checkpoint row by position.
The selected session can be used as a context manager. Bound component state and metadata are restored from the selected catalog row when the session becomes active.
- Parameters:
row (int) – Positional row index. Negative indexing is supported.
- Returns:
Session representing the selected catalog row.
- Return type:
- Raises:
IndexError – If no catalog row matches the selection.