cache#

Cache directory management and introspection for pipeline SQLite databases.

Provides utilities to locate, list, inspect, and clean up .db files produced by pipeline runs. The default cache location follows the XDG Base Directory Specification and can be overridden with the PSNC_CACHE_DIR environment variable.

Usage#

>>> from physicsnemo_curator.core.cache import default_cache_dir, list_databases
>>> cache = default_cache_dir()
>>> for info in list_databases(cache):
...     print(info.hash_prefix, info.source_name, info.completed)

Attributes#

Classes#

DBInfo

Metadata about a single pipeline database file.

Functions#

cache_size(→ int)

Return the total size in bytes of all .db files in the cache.

clear_cache(→ int)

Remove all .db files from the cache directory.

default_cache_dir(→ pathlib.Path)

Return the default cache directory for pipeline databases.

default_data_cache_dir(→ pathlib.Path)

Return the persistent cache directory for downloaded source data.

list_databases(→ list[DBInfo])

List all pipeline databases in the cache directory.

remove_databases(→ int)

Remove pipeline databases matching the given identifiers.

remove_older_than(→ int)

Remove pipeline databases older than max_age (by file mtime).

Module Contents#

class physicsnemo_curator.core.cache.DBInfo[source]#

Metadata about a single pipeline database file.

Parameters:
  • hash_prefix (str) – Filename stem (the config hash prefix used as the DB name).

  • path (pathlib.Path) – Absolute path to the .db file.

  • size_bytes (int) – File size in bytes.

  • created (datetime) – Pipeline run start timestamp (from pipeline_runs.started_at).

  • source_name (str) – Registered source name extracted from the stored config JSON.

  • sink_name (str) – Registered sink name extracted from the stored config JSON.

  • filter_names (list[str]) – Registered filter names extracted from the stored config JSON.

  • total (int) – Total number of index_results rows (completed + failed).

  • completed (int) – Number of completed index results.

  • failed (int) – Number of failed index results.

completed: int = 0#
created: datetime.datetime#
failed: int = 0#
filter_names: list[str] = []#
hash_prefix: str#
path: pathlib.Path#
sink_name: str#
size_bytes: int#
source_name: str#
total: int = 0#
physicsnemo_curator.core.cache.cache_size(*, cache_dir: pathlib.Path | None = None) int[source]#

Return the total size in bytes of all .db files in the cache.

Parameters:

cache_dir (pathlib.Path | None, optional) – Directory to measure. Defaults to default_cache_dir().

Returns:

Total bytes occupied by .db files, or 0 if the directory is empty or does not exist.

Return type:

int

physicsnemo_curator.core.cache.clear_cache(*, cache_dir: pathlib.Path | None = None) int[source]#

Remove all .db files from the cache directory.

Parameters:

cache_dir (pathlib.Path | None, optional) – Directory to clear. Defaults to default_cache_dir().

Returns:

Number of database files removed.

Return type:

int

physicsnemo_curator.core.cache.default_cache_dir() pathlib.Path[source]#

Return the default cache directory for pipeline databases.

Resolution order (highest priority first):

  1. PSNC_CACHE_DIR environment variable

  2. $XDG_CACHE_HOME/psnc/

  3. ~/.cache/psnc/

Returns:

Absolute path to the cache directory (may not exist yet).

Return type:

pathlib.Path

Examples

>>> import os
>>> os.environ["PSNC_CACHE_DIR"] = "/tmp/my_cache"
>>> default_cache_dir()
PosixPath('/tmp/my_cache')
physicsnemo_curator.core.cache.default_data_cache_dir(source_name: str) pathlib.Path[source]#

Return the persistent cache directory for downloaded source data.

Provides a standard location for remote sources to store downloaded files so they persist across pipeline runs. The directory is created if it does not yet exist.

Resolution order follows default_cache_dir(), with data/<source_name> appended:

  1. $PSNC_CACHE_DIR/data/<source_name>/

  2. $XDG_CACHE_HOME/psnc/data/<source_name>/

  3. ~/.cache/psnc/data/<source_name>/

Parameters:

source_name (str) – Short identifier for the source (e.g. "drivaerml", "ahmedml"). Used as the subdirectory name.

Returns:

Absolute path to the data cache directory (created if needed).

Return type:

pathlib.Path

Examples

>>> default_data_cache_dir("drivaerml")
PosixPath('/home/user/.cache/psnc/data/drivaerml')
physicsnemo_curator.core.cache.list_databases(cache_dir: pathlib.Path | None = None) list[DBInfo][source]#

List all pipeline databases in the cache directory.

Opens each .db file, reads the pipeline_runs and index_results tables, and returns metadata sorted newest first (by started_at timestamp). Corrupt or unreadable databases are silently skipped.

Parameters:

cache_dir (pathlib.Path | None, optional) – Directory to scan. Defaults to default_cache_dir().

Returns:

Metadata for each valid database, sorted newest first.

Return type:

list[DBInfo]

physicsnemo_curator.core.cache.remove_databases(
identifiers: list[str],
*,
cache_dir: pathlib.Path | None = None,
) int[source]#

Remove pipeline databases matching the given identifiers.

Each identifier is first tested as an exact stem match. If no exact match is found it is treated as a prefix and matched against .db file stems. A prefix that matches more than one file raises ValueError to prevent accidental deletion.

Parameters:
Returns:

Number of database files removed.

Return type:

int

Raises:

ValueError – If a prefix is ambiguous (matches more than one .db file).

physicsnemo_curator.core.cache.remove_older_than(
max_age: datetime.timedelta,
*,
cache_dir: pathlib.Path | None = None,
) int[source]#

Remove pipeline databases older than max_age (by file mtime).

Parameters:
  • max_age (timedelta) – Maximum age. Files with an mtime older than now - max_age are removed.

  • cache_dir (pathlib.Path | None, optional) – Directory to scan. Defaults to default_cache_dir().

Returns:

Number of database files removed.

Return type:

int

physicsnemo_curator.core.cache.logger#