API Reference

Storage Client

class StorageClient(config: StorageClientConfig)[source]

A client for interacting with different storage providers.

Initializes the StorageClient with the given configuration.

Parameters:

config (StorageClientConfig) – The configuration object for the storage client.

commit_metadata(prefix: str | None = None) None[source]

Commits any pending updates to the metadata provider. No-op if not using a metadata provider.

Parameters:

prefix (str | None) – If provided, scans the prefix to find files to commit.

Return type:

None

copy(src_path: str, dest_path: str) None[source]

Copies an object from source to destination path.

Parameters:
  • src_path (str) – The logical path of the source object to copy.

  • dest_path (str) – The logical path of the destination.

Return type:

None

delete(path: str, recursive: bool = False) None[source]

Deletes an object at the specified path.

Parameters:
  • path (str) – The logical path of the object to delete.

  • recursive (bool) – Whether to delete objects in the path recursively.

Return type:

None

glob(pattern: str, include_url_prefix: bool = False, attribute_filter_expression: str | None = None) list[str][source]

Matches and retrieves a list of objects in the storage provider that match the specified pattern.

Parameters:
  • pattern (str) – The pattern to match object paths against, supporting wildcards (e.g., *.txt).

  • include_url_prefix (bool) – Whether to include the URL prefix msc://profile in the result.

  • attribute_filter_expression (str | None) – The attribute filter expression to apply to the result.

Returns:

A list of object paths that match the pattern.

Return type:

list[str]

info(path: str, strict: bool = True) ObjectMetadata[source]

Retrieves metadata or information about an object stored at the specified path.

Parameters:
  • path (str) – The logical path to the object for which metadata or information is being retrieved.

  • strict (bool) – If True, performs additional validation to determine whether the path refers to a directory.

Returns:

A dictionary containing metadata about the object.

Return type:

ObjectMetadata

is_default_profile() bool[source]

Return True if the storage client is using the default profile.

Return type:

bool

is_empty(path: str) bool[source]

Checks whether the specified path is empty. A path is considered empty if there are no objects whose keys start with the given path as a prefix.

Parameters:

path (str) – The logical path to check. This is typically a prefix representing a directory or folder.

Returns:

True if no objects exist under the specified path prefix, False otherwise.

Return type:

bool

is_file(path: str) bool[source]

Checks whether the specified path points to a file (rather than a directory or folder).

Parameters:

path (str) – The logical path to check.

Returns:

True if the path points to a file, False otherwise.

Return type:

bool

list(prefix: str = '', start_after: str | None = None, end_at: str | None = None, include_directories: bool = False, include_url_prefix: bool = False, attribute_filter_expression: str | None = None) Iterator[ObjectMetadata][source]

Lists objects in the storage provider under the specified prefix.

Parameters:
  • prefix (str) – The prefix to list objects under.

  • start_after (str | None) – The key to start after (i.e. exclusive). An object with this key doesn’t have to exist.

  • end_at (str | None) – The key to end at (i.e. inclusive). An object with this key doesn’t have to exist.

  • include_directories (bool) – Whether to include directories in the result. When True, directories are returned alongside objects.

  • include_url_prefix (bool) – Whether to include the URL prefix msc://profile in the result.

  • attribute_filter_expression (str | None) – The attribute filter expression to apply to the result.

Returns:

An iterator over objects.

Return type:

Iterator[ObjectMetadata]

open(path: str, mode: str = 'rb', buffering: int = -1, encoding: str | None = None, disable_read_cache: bool = False, memory_load_limit: int = 536870912, atomic: bool = True, check_source_version: SourceVersionCheckMode = SourceVersionCheckMode.INHERIT, attributes: dict[str, str] | None = None) PosixFile | ObjectFile[source]

Returns a file-like object from the specified path.

Parameters:
  • path (str) – The logical path of the object to read.

  • mode (str) – The file mode, only “w”, “r”, “a”, “wb”, “rb” and “ab” are supported.

  • buffering (int) – The buffering mode. Only applies to PosixFile.

  • encoding (str | None) – The encoding to use for text files.

  • disable_read_cache (bool) – When set to True, disables caching for the file content. This parameter is only applicable to ObjectFile when the mode is “r” or “rb”.

  • memory_load_limit (int) – Size limit in bytes for loading files into memory. Defaults to 512MB. This parameter is only applicable to ObjectFile when the mode is “r” or “rb”.

  • atomic (bool) – When set to True, the file will be written atomically (rename upon close). This parameter is only applicable to PosixFile in write mode.

  • check_source_version (SourceVersionCheckMode) – Whether to check the source version of cached objects.

  • attributes (dict[str, str] | None) – The attributes to add to the file. This parameter is only applicable when the mode is “w” or “wb” or “a” or “ab”.

Returns:

A file-like object (PosixFile or ObjectFile) for the specified path.

Return type:

PosixFile | ObjectFile

sync_from(source_client: StorageClient, source_path: str = '', target_path: str = '', delete_unmatched_files: bool = False, description: str = 'Syncing', num_worker_processes: int | None = None) None[source]

Syncs files from the source storage client to “path/”.

Parameters:
  • source_client (StorageClient) – The source storage client.

  • source_path (str) – The logical path to sync from.

  • target_path (str) – The logical path to sync to.

  • delete_unmatched_files (bool) – Whether to delete files at the target that are not present at the source.

  • description (str) – Description of sync process for logging purposes.

  • num_worker_processes (int | None) – The number of worker processes to use.

Return type:

None

class StorageClientConfig(profile: str, storage_provider: StorageProvider, credentials_provider: CredentialsProvider | None = None, metadata_provider: MetadataProvider | None = None, cache_config: CacheConfig | None = None, cache_manager: CacheBackend | None = None, retry_config: RetryConfig | None = None, metric_gauges: dict[GaugeName, Gauge] | None = None, metric_counters: dict[CounterName, Counter] | None = None, metric_attributes_providers: Sequence[AttributesProvider] | None = None)[source]

Configuration class for the multistorageclient.StorageClient.

Parameters:
static read_msc_config() dict[str, Any] | None[source]

Get the MSC configuration dictionary.

Returns:

The MSC configuration dictionary or empty dict if no config was found

Return type:

dict[str, Any] | None

static read_path_mapping() PathMapping[source]

Get the path mapping defined in the MSC configuration.

Path mappings create a nested structure of protocol -> bucket -> [(prefix, profile)] where entries are sorted by prefix length (longest first) for optimal matching. Longer paths take precedence when matching.

Returns:

A PathMapping instance with translation mappings

Return type:

PathMapping

Pathlib

class MultiStoragePath(path: str | PathLike)[source]

A path object similar to pathlib.Path that supports both local and remote file systems.

MultiStoragePath provides a unified interface for working with paths across different storage systems, including local files, S3, GCS, Azure Blob Storage, and more. It uses the “msc://” protocol prefix to identify remote storage paths.

This implementation is based on Python 3.9’s pathlib.Path interface, providing compatible behavior for local filesystem operations while extending support to remote storage systems.

Examples:
>>> import multistorageclient as msc
>>> msc.Path("/local/path/file.txt")
>>> msc.Path("msc://my-profile/data/file.txt")
>>> msc.Path(pathlib.Path("relative/path"))

Initialize path object supporting multiple storage backends.

Parameters:

path (str | PathLike) – String, Path, or MultiStoragePath. Relative paths are automatically converted to absolute.

absolute()[source]

Return the path itself since it is always absolute.

property anchor: str

The concatenation of the drive and root, or ‘’.

as_posix() str[source]

Return the string representation of the path with forward (/) slashes.

If the path is a remote path, the file content is downloaded to local storage (either cached or temporary file) and the local filesystem path is returned. This enables access to remote file content through standard filesystem operations.

Return type:

str

chmod(mode)[source]

Change the permissions of the path, like os.chmod().

Not supported for remote storage paths.

classmethod cwd()[source]

Return a new path pointing to the current working directory.

exists() bool[source]

Return True if the path exists.

Return type:

bool

expanduser()[source]

Return a new path with expanded ~ and ~user constructs (as returned by os.path.expanduser).

Not supported for remote storage paths.

glob(pattern)[source]

Iterate over this subtree and yield all existing files (of any kind, including directories) matching the given relative pattern.

group()[source]

Return the group name of the file gid.

Not supported for remote storage paths.

classmethod home()[source]

Return a new path pointing to the user’s home directory.

is_absolute() bool[source]

Paths are always absolute.

Return type:

bool

is_block_device()[source]

Return True if the path exists and is a block device.

Not supported for remote storage paths.

is_char_device()[source]

Return True if the path exists and is a character device.

Not supported for remote storage paths.

is_dir(strict: bool = True) bool[source]

Return True if the path exists and is a directory.

Parameters:

strict (bool)

Return type:

bool

is_fifo()[source]

Return True if the path exists and is a FIFO.

Not supported for remote storage paths.

is_file(strict: bool = True) bool[source]

Return True if the path exists and is a regular file.

Parameters:

strict (bool)

Return type:

bool

is_mount()[source]

Return True if the path exists and is a mount point.

Not supported for remote storage paths.

is_relative_to(other: MultiStoragePath) bool[source]

Return True if the path is relative to another path or False.

Parameters:

other (MultiStoragePath)

Return type:

bool

is_reserved() bool[source]
Return type:

bool

is_socket()[source]

Return True if the path exists and is a socket.

Not supported for remote storage paths.

Return True if the path exists and is a symbolic link.

Not supported for remote storage paths.

iterdir()[source]

Yield path objects of the directory contents.

joinpath(*pathsegments)[source]
lchmod(mode)[source]

Like chmod(), except if the path points to a symlink, the symlink’s permissions are changed, rather than its target’s.

Not supported for remote storage paths.

lstat()[source]

Like stat(), except if the path points to a symlink, the symlink’s status information is returned, rather than its target’s.

If the path is a remote path, the result is a multistorageclient.pathlib.StatResult object.

match(pattern) bool[source]

Return True if this path matches the given pattern.

Return type:

bool

mkdir(mode=511, parents=False, exist_ok=False) None[source]

Create a new directory at this given path.

For remote storage paths, this operation is a no-op.

Return type:

None

property name: str

The final path component, if any.

open(mode='r', buffering=-1, encoding=None, errors=None, newline=None, check_source_version=SourceVersionCheckMode.INHERIT)[source]

Open the file and return a file object.

owner()[source]

Return the login name of the file owner.

Not supported for remote storage paths.

property parent: MultiStoragePath

The logical parent of the path.

property parents: list[MultiStoragePath]

A sequence of this path’s logical parents.

property parts

An object providing sequence-like access to the components in the filesystem path (does not include the msc:// and the profile name).

read_bytes() bytes[source]

Open the file in bytes mode, read it, and close the file.

Return type:

bytes

read_text(encoding: str = 'utf-8', errors: str = 'strict') str[source]

Open the file in text mode, read it, and close the file.

Parameters:
Return type:

str

Return the path to which the symbolic link points.

Not supported for remote storage paths.

relative_to(other: MultiStoragePath) MultiStoragePath[source]

Not implemented.

Parameters:

other (MultiStoragePath)

Return type:

MultiStoragePath

rename(target) MultiStoragePath[source]

Rename this path to the target path.

Return type:

MultiStoragePath

replace(target)[source]

Rename this path to the target path, overwriting if that path exists.

Not supported for remote storage paths.

resolve(strict=False)[source]

Return the absolute path.

rglob(pattern)[source]

Recursively yield all existing files (of any kind, including directories) matching the given relative pattern, anywhere in this subtree.

rmdir() None[source]

Remove this directory. The directory must be empty.

Not supported for remote storage paths.

Return type:

None

samefile(other_path)[source]

Return True if both paths point to the same file or directory.

Not supported for remote storage paths.

stat()[source]

Return the result of the stat() system call on this path, like os.stat() does.

If the path is a remote path, the result is a multistorageclient.pathlib.StatResult object.

property stem: str

The final path component, minus its last suffix.

property suffix: str

The final path component, if any.

property suffixes: list[str]

A list of the final component’s suffixes, if any.

These include the leading periods. For example: [‘.tar’, ‘.gz’]

Make this path a symlink pointing to the target path.

Not supported for remote storage paths.

touch(mode=438, exist_ok=False)[source]

Create this file with the given access mode, if it doesn’t exist.

Remove this file or link. If the path is a directory, use rmdir() instead.

Parameters:

missing_ok (bool)

Return type:

None

walk(top_down=True, on_error=None, follow_symlinks=False)[source]

Walk the directory tree from this directory, similar to os.walk().

Not supported for remote storage paths.

with_name(name: str) MultiStoragePath[source]

Return a new path with the file name changed.

Parameters:

name (str)

Return type:

MultiStoragePath

with_segments(*pathsegments) MultiStoragePath[source]

Construct a new path object from any number of path-like objects.

Return type:

MultiStoragePath

with_stem(stem: str) MultiStoragePath[source]

Return a new path with the stem changed.

Parameters:

stem (str)

Return type:

MultiStoragePath

with_suffix(suffix: str) MultiStoragePath[source]

Return a new path with the file suffix changed. If the path has no suffix, add given suffix. If the given suffix is an empty string, remove the suffix from the path.

Parameters:

suffix (str)

Return type:

MultiStoragePath

write_bytes(data: bytes) None[source]

Open the file in bytes mode, write to it, and close the file.

Parameters:

data (bytes)

Return type:

None

write_text(data: str, encoding: str = 'utf-8', errors: str = 'strict') None[source]

Open the file in text mode, write to it, and close the file.

Parameters:
Return type:

None

class StatResult(metadata: ObjectMetadata)[source]

A stat-like result object that mimics os.stat_result for remote storage paths.

This class provides the same interface as os.stat_result but is populated from ObjectMetadata obtained from storage providers.

Initialize StatResult from ObjectMetadata.

Parameters:

metadata (ObjectMetadata)

Shortcuts

commit_metadata(url: str) None[source]

Commits the metadata updates for the specified storage client profile.

Parameters:

url (str) – The URL of the path to commit metadata for.

Return type:

None

delete(url: str, recursive: bool = False) None[source]

Deletes the specified object(s) from the storage provider.

This function retrieves the corresponding multistorageclient.StorageClient for the given URL and deletes the object(s) at the specified path.

Parameters:
  • url (str) – The URL of the object to delete. (example: msc://profile/prefix/file.txt)

  • recursive (bool) – Whether to delete objects in the path recursively.

Return type:

None

download_file(url: str, local_path: str) None[source]

Download a file in a given remote_path to a local path

The function utilizes the multistorageclient.StorageClient to download a file (object) at the provided path. The URL is parsed, and the corresponding multistorageclient.StorageClient is retrieved or built.

Parameters:
  • url (str) – The URL of the file to download. (example: msc://profile/prefix/dataset.tar)

  • local_path (str) – The local path where the file should be downloaded.

Raises:

ValueError – If the URL’s protocol does not match the expected protocol msc.

Return type:

None

get_telemetry() Telemetry | None[source]

Get the :py:class:Telemetry instance to use for storage clients created by shortcuts.

Returns:

A telemetry instance.

Return type:

Telemetry | None

glob(pattern: str, attribute_filter_expression: str | None = None) list[str][source]

Return a list of files matching a pattern.

This function supports glob-style patterns for matching multiple files within a storage system. The pattern is parsed, and the associated multistorageclient.StorageClient is used to retrieve the list of matching files.

Parameters:
  • pattern (str) – The glob-style pattern to match files. (example: msc://profile/prefix/**/*.tar)

  • attribute_filter_expression (str | None) – The attribute filter expression to apply to the result.

Returns:

A list of file paths matching the pattern.

Raises:

ValueError – If the URL’s protocol does not match the expected protocol msc.

Return type:

list[str]

info(url: str) ObjectMetadata[source]

Retrieves metadata or information about an object stored at the specified path.

Parameters:

url (str) – The URL of the object to retrieve information about. (example: msc://profile/prefix/file.txt)

Returns:

An ObjectMetadata object representing the object’s metadata.

Return type:

ObjectMetadata

is_empty(url: str) bool[source]

Checks whether the specified URL contains any objects.

Parameters:

url (str) – The URL to check, typically pointing to a storage location.

Returns:

True if there are no objects/files under this URL, False otherwise.

Raises:

ValueError – If the URL’s protocol does not match the expected protocol msc.

Return type:

bool

is_file(url: str) bool[source]

Checks whether the specified url points to a file (rather than a directory or folder).

The function utilizes the multistorageclient.StorageClient to check if a file (object) exists at the provided path. The URL is parsed, and the corresponding multistorageclient.StorageClient is retrieved or built.

Parameters:

url (str) – The URL to check the existence of a file. (example: msc://profile/prefix/dataset.tar)

Return type:

bool

list(url: str, start_after: str | None = None, end_at: str | None = None, include_directories: bool = False, attribute_filter_expression: str | None = None) Iterator[ObjectMetadata][source]

Lists the contents of the specified URL prefix.

This function retrieves the corresponding multistorageclient.StorageClient for the given URL and returns an iterator of objects (files or directories) stored under the provided prefix.

Parameters:
  • url (str) – The prefix to list objects under.

  • start_after (str | None) – The key to start after (i.e. exclusive). An object with this key doesn’t have to exist.

  • end_at (str | None) – The key to end at (i.e. inclusive). An object with this key doesn’t have to exist.

  • include_directories (bool) – Whether to include directories in the result. When True, directories are returned alongside objects.

  • attribute_filter_expression (str | None) – The attribute filter expression to apply to the result.

Returns:

An iterator of ObjectMetadata objects representing the files (and optionally directories) accessible under the specified URL prefix. The returned keys will always be prefixed with msc://.

Return type:

Iterator[ObjectMetadata]

open(url: str, mode: str = 'rb', **kwargs: Any) PosixFile | ObjectFile[source]

Open a file at the given URL using the specified mode.

The function utilizes the multistorageclient.StorageClient to open a file at the provided path. The URL is parsed, and the corresponding multistorageclient.StorageClient is retrieved or built.

Parameters:
  • url (str) – The URL of the file to open. (example: msc://profile/prefix/dataset.tar)

  • mode (str) – The file mode to open the file in.

  • kwargs (Any)

Returns:

A file-like object that allows interaction with the file.

Raises:

ValueError – If the URL’s protocol does not match the expected protocol msc.

Return type:

PosixFile | ObjectFile

resolve_storage_client(url: str) tuple[StorageClient, str][source]

Build and return a multistorageclient.StorageClient instance based on the provided URL or path.

This function parses the given URL or path and determines the appropriate storage profile and path. It supports URLs with the protocol msc://, as well as POSIX paths or file:// URLs for local file system access. If the profile has already been instantiated, it returns the cached client. Otherwise, it creates a new StorageClient and caches it.

The function also supports implicit profiles for non-MSC URLs. When a non-MSC URL is provided (like s3://, gs://, ais://, file://), MSC will infer the storage provider based on the URL protocol and create an implicit profile with the naming convention “_protocol-bucket” (e.g., “_s3-bucket1”, “_gs-bucket1”).

Path mapping defined in the MSC configuration are also applied before creating implicit profiles. This allows for explicit mappings between source paths and destination MSC profiles.

Parameters:

url (str) – The storage location, which can be: - A URL in the format msc://profile/path for object storage. - A local file system path (absolute POSIX path) or a file:// URL. - A non-MSC URL with a supported protocol (s3://, gs://, ais://).

Returns:

A tuple containing the multistorageclient.StorageClient instance and the parsed path.

Raises:

ValueError – If the URL’s protocol is neither msc nor a valid local file system path or a supported non-MSC protocol.

Return type:

tuple[StorageClient, str]

set_telemetry(telemetry: Telemetry | None) None[source]

Set the :py:class:Telemetry instance to use for storage clients created by shortcuts.

Parameters:

telemetry (Telemetry | None) – A telemetry instance.

Return type:

None

sync(source_url: str, target_url: str, delete_unmatched_files: bool = False) None[source]

Syncs files from the source storage to the target storage.

Parameters:
  • source_url (str) – The URL for the source storage.

  • target_url (str) – The URL for the target storage.

  • delete_unmatched_files (bool) – Whether to delete files at the target that are not present at the source.

Return type:

None

upload_file(url: str, local_path: str, attributes: dict[str, str] | None = None) None[source]

Upload a file to the given URL from a local path.

The function utilizes the multistorageclient.StorageClient to upload a file (object) to the provided path. The URL is parsed, and the corresponding multistorageclient.StorageClient is retrieved or built.

Parameters:
  • url (str) – The URL of the file. (example: msc://profile/prefix/dataset.tar)

  • local_path (str) – The local path of the file.

  • attributes (dict[str, str] | None)

Raises:

ValueError – If the URL’s protocol does not match the expected protocol msc.

Return type:

None

write(url: str, body: bytes, attributes: dict[str, str] | None = None) None[source]

Writes an object to the storage provider at the specified path.

Parameters:
  • url (str) – The path where the object should be written.

  • body (bytes) – The content to write to the object.

  • attributes (dict[str, str] | None)

Return type:

None

Types

class Credentials(access_key: str, secret_key: str, token: str | None, expiration: str | None, custom_fields: dict[str, ~typing.Any] = <factory>)[source]

A data class representing the credentials needed to access a storage provider.

Parameters:
access_key: str

The access key for authentication.

custom_fields: dict[str, Any]

A dictionary for storing custom key-value pairs.

expiration: str | None

The expiration time of the credentials in ISO 8601 format.

get_custom_field(key: str, default: Any | None = None) Any[source]

Retrieves a value from custom fields by its key.

Parameters:
  • key (str) – The key to look up in custom fields.

  • default (Any | None) – The default value to return if the key is not found.

Returns:

The value associated with the key, or the default value if not found.

Return type:

Any

is_expired() bool[source]

Checks if the credentials are expired based on the expiration time.

Returns:

True if the credentials are expired, False otherwise.

Return type:

bool

secret_key: str

The secret key for authentication.

token: str | None

An optional security token for temporary credentials.

class CredentialsProvider[source]

Abstract base class for providing credentials to access a storage provider.

abstract get_credentials() Credentials[source]

Retrieves the current credentials.

Returns:

The current credentials used for authentication.

Return type:

Credentials

abstract refresh_credentials() None[source]

Refreshes the credentials if they are expired or about to expire.

Return type:

None

class MetadataProvider[source]

Abstract base class for accessing file metadata.

abstract add_file(path: str, metadata: ObjectMetadata) None[source]

Add a file to be tracked by the MetadataProvider. Does not have to be reflected in listing until a MetadataProvider.commit_updates() forces a persist. This function must tolerate duplicate calls (idempotent behavior).

Parameters:
  • path (str) – User-supplied virtual path

  • metadata (ObjectMetadata) – physical file metadata from StorageProvider

Return type:

None

abstract commit_updates() None[source]

Commit any newly adding files, used in conjunction with MetadataProvider.add_file(). MetadataProvider will persistently record any metadata changes.

Return type:

None

abstract get_object_metadata(path: str, include_pending: bool = False) ObjectMetadata[source]

Retrieves metadata or information about an object stored in the provider.

Parameters:
  • path (str) – The path of the object.

  • include_pending (bool) – Whether to include metadata that is not yet committed.

Returns:

A metadata object containing the information about the object.

Return type:

ObjectMetadata

abstract glob(pattern: str) list[str][source]

Matches and retrieves a list of object keys in the storage provider that match the specified pattern.

Parameters:

pattern (str) – The pattern to match object keys against, supporting wildcards (e.g., *.txt).

Returns:

A list of object keys that match the specified pattern.

Return type:

list[str]

abstract is_writable() bool[source]

Returns True if the MetadataProvider supports writes else False.

Return type:

bool

abstract list_objects(prefix: str, start_after: str | None = None, end_at: str | None = None, include_directories: bool = False) Iterator[ObjectMetadata][source]

Lists objects in the storage provider under the specified prefix.

Parameters:
  • prefix (str) – The prefix or path to list objects under.

  • start_after (str | None) – The key to start after (i.e. exclusive). An object with this key doesn’t have to exist.

  • end_at (str | None) – The key to end at (i.e. inclusive). An object with this key doesn’t have to exist.

  • include_directories (bool) – Whether to include directories in the result. When True, directories are returned alongside objects.

Returns:

A iterator over objects metadata under the specified prefix.

Return type:

Iterator[ObjectMetadata]

abstract realpath(path: str) tuple[str, bool][source]

Returns the canonical, full real physical path for use by a StorageProvider. This provides translation from user-visible paths to the canonical paths needed by a StorageProvider.

Parameters:

path (str) – user-supplied virtual path

Returns:

A canonical physical path and if the object at the path is valid

Return type:

tuple[str, bool]

abstract remove_file(path: str) None[source]

Remove a file tracked by the MetadataProvider. Does not have to be reflected in listing until a MetadataProvider.commit_updates() forces a persist. This function must tolerate duplicate calls (idempotent behavior).

Parameters:

path (str) – User-supplied virtual path

Return type:

None

exception NotModifiedError[source]

Raised when a conditional operation fails because the resource has not been modified.

This typically occurs when using if-none-match with a specific generation/etag and the resource’s current generation/etag matches the specified one.

class ObjectMetadata(key: str, content_length: int, last_modified: datetime, type: str = 'file', content_type: str | None = None, etag: str | None = None, storage_class: str | None = None, metadata: dict[str, Any] | None = None)[source]

A data class that represents the metadata associated with an object stored in a cloud storage service. This metadata includes both required and optional information about the object.

Parameters:
content_length: int

The size of the object in bytes.

content_type: str | None = None

The MIME type of the object.

etag: str | None = None

The entity tag (ETag) of the object.

static from_dict(data: dict) ObjectMetadata[source]

Creates an ObjectMetadata instance from a dictionary (parsed from JSON).

Parameters:

data (dict)

Return type:

ObjectMetadata

key: str

Relative path of the object.

last_modified: datetime

The timestamp indicating when the object was last modified.

metadata: dict[str, Any] | None = None
storage_class: str | None = None

The storage class of the object.

to_dict() dict[source]
Return type:

dict

type: str = 'file'
exception PreconditionFailedError[source]

Exception raised when a precondition fails. e.g. if-match, if-none-match, etc.

class ProviderBundle[source]

Abstract base class that serves as a container for various providers (storage, credentials, and metadata) that interact with a storage service. The ProviderBundle abstracts access to these providers, allowing for flexible implementations of cloud storage solutions.

abstract property credentials_provider: CredentialsProvider | None
Returns:

The credentials provider responsible for managing authentication credentials required to access the storage service.

abstract property metadata_provider: MetadataProvider | None
Returns:

The metadata provider responsible for retrieving metadata about objects in the storage service.

abstract property storage_provider_config: StorageProviderConfig
Returns:

The configuration for the storage provider, which includes the provider name/type and additional options.

class Range(offset: int, size: int)[source]

Byte-range read.

Parameters:
offset: int
size: int
class RetryConfig(attempts: int = 3, delay: float = 1.0)[source]

A data class that represents the configuration for retry strategy.

Parameters:
attempts: int = 3

The number of attempts before giving up. Must be at least 1.

delay: float = 1.0

The delay (in seconds) between retry attempts. Must be a non-negative value.

exception RetryableError[source]

Exception raised for errors that should trigger a retry.

class SourceVersionCheckMode(value)[source]

Enum for controlling source version checking behavior.

DISABLE = 'disable'
ENABLE = 'enable'
INHERIT = 'inherit'
class StorageProvider[source]

Abstract base class for interacting with a storage provider.

abstract copy_object(src_path: str, dest_path: str) None[source]

Copies an object from source to destination in the storage provider.

Parameters:
  • src_path (str) – The path of the source object to copy.

  • dest_path (str) – The path of the destination.

Return type:

None

abstract delete_object(path: str, if_match: str | None = None) None[source]

Deletes an object from the storage provider.

Parameters:
  • path (str) – The path of the object to delete.

  • if_match (str | None) – Optional if-match value to use for conditional deletion.

Return type:

None

abstract download_file(remote_path: str, f: str | IO, metadata: ObjectMetadata | None = None) None[source]

Downloads a file from the storage provider to the local file system.

Parameters:
  • remote_path (str) – The path of the file to download.

  • f (str | IO) – The destination for the downloaded file. This can either be a string representing the local file path where the file will be saved, or a file-like object to write the downloaded content into.

  • metadata (ObjectMetadata | None) – Metadata about the object to download.

Return type:

None

abstract get_object(path: str, byte_range: Range | None = None) bytes[source]

Retrieves an object from the storage provider.

Parameters:
  • path (str) – The path where the object is stored.

  • byte_range (Range | None)

Returns:

The content of the retrieved object.

Return type:

bytes

abstract get_object_metadata(path: str, strict: bool = True) ObjectMetadata[source]

Retrieves metadata or information about an object stored in the provider.

Parameters:
  • path (str) – The path of the object.

  • strict (bool) – If True, performs additional validation to determine whether the path refers to a directory.

Returns:

A metadata object containing the information about the object.

Return type:

ObjectMetadata

abstract glob(pattern: str, attribute_filter_expression: str | None = None) list[str][source]

Matches and retrieves a list of object keys in the storage provider that match the specified pattern.

Parameters:
  • pattern (str) – The pattern to match object keys against, supporting wildcards (e.g., *.txt).

  • attribute_filter_expression (str | None) – The attribute filter expression to apply to the result.

Returns:

A list of object keys that match the specified pattern.

Return type:

list[str]

abstract is_file(path: str) bool[source]

Checks whether the specified key in the storage provider points to a file (as opposed to a folder or directory).

Parameters:

path (str) – The path to check.

Returns:

True if the key points to a file, False if it points to a directory or folder.

Return type:

bool

abstract list_objects(prefix: str, start_after: str | None = None, end_at: str | None = None, include_directories: bool = False, attribute_filter_expression: str | None = None) Iterator[ObjectMetadata][source]

Lists objects in the storage provider under the specified prefix.

Parameters:
  • prefix (str) – The prefix or path to list objects under.

  • start_after (str | None) – The key to start after (i.e. exclusive). An object with this key doesn’t have to exist.

  • end_at (str | None) – The key to end at (i.e. inclusive). An object with this key doesn’t have to exist.

  • include_directories (bool) – Whether to include directories in the result. When True, directories are returned alongside objects.

  • attribute_filter_expression (str | None) – The attribute filter expression to apply to the result.

Returns:

An iterator over objects metadata under the specified prefix.

Return type:

Iterator[ObjectMetadata]

abstract put_object(path: str, body: bytes, if_match: str | None = None, if_none_match: str | None = None, attributes: dict[str, str] | None = None) None[source]

Uploads an object to the storage provider.

Parameters:
  • path (str) – The path where the object will be stored.

  • body (bytes) – The content of the object to store.

  • attributes (dict[str, str] | None) – The attributes to add to the file

  • if_match (str | None)

  • if_none_match (str | None)

Return type:

None

abstract upload_file(remote_path: str, f: str | IO, attributes: dict[str, str] | None = None) None[source]

Uploads a file from the local file system to the storage provider.

Parameters:
  • remote_path (str) – The path where the object will be stored.

  • f (str | IO) – The source file to upload. This can either be a string representing the local file path, or a file-like object (e.g., an open file handle).

  • attributes (dict[str, str] | None) – The attributes to add to the file if a new file is created.

Return type:

None

class StorageProviderConfig(type: str, options: dict[str, Any] | None = None)[source]

A data class that represents the configuration needed to initialize a storage provider.

Parameters:
options: dict[str, Any] | None = None

Additional options required to configure the storage provider (e.g., endpoint URLs, region, etc.).

type: str

The name or type of the storage provider (e.g., s3, gcs, oci, azure).

Providers

class PosixFileStorageProvider(base_path: str, metric_counters: dict[CounterName, Counter] = {}, metric_gauges: dict[GaugeName, Gauge] = {}, metric_attributes_providers: Sequence[AttributesProvider] = (), **kwargs: Any)[source]

A concrete implementation of the multistorageclient.types.StorageProvider for interacting with POSIX file systems.

Parameters:
  • base_path (str) – The root prefix path within the POSIX file system where all operations will be scoped.

  • metric_counters (dict[CounterName, Counter]) – Metric counters.

  • metric_gauges (dict[GaugeName, Gauge]) – Metric gauges.

  • metric_attributes_providers (Sequence[AttributesProvider]) – Metric attributes providers.

  • kwargs (Any)

glob(pattern: str, attribute_filter_expression: str | None = None) list[str][source]

Matches and retrieves a list of object keys in the storage provider that match the specified pattern.

Parameters:
  • pattern (str) – The pattern to match object keys against, supporting wildcards (e.g., *.txt).

  • attribute_filter_expression (str | None) – The attribute filter expression to apply to the result.

Returns:

A list of object keys that match the specified pattern.

Return type:

list[str]

is_file(path: str) bool[source]

Checks whether the specified key in the storage provider points to a file (as opposed to a folder or directory).

Parameters:

path (str) – The path to check.

Returns:

True if the key points to a file, False if it points to a directory or folder.

Return type:

bool

rmtree(path: str) None[source]
Parameters:

path (str)

Return type:

None

atomic_write(source: str | IO, destination: str, attributes: dict[str, str] | None = None)[source]

Writes the contents of a file to the specified destination path.

This function ensures that the file write operation is atomic, meaning the output file is either fully written or not modified at all. This is achieved by writing to a temporary file first and then renaming it to the destination path.

Parameters:
  • source (str | IO) – The input file to read from. It can be a string representing the path to a file, or an open file-like object (IO).

  • destination (str) – The path to the destination file where the contents should be written.

  • attributes (dict[str, str] | None) – The attributes to set on the file.

class Manifest(version: str, parts: list[ManifestPartReference])[source]

A data class representing a dataset manifest.

Parameters:
static from_dict(data: dict) Manifest[source]

Creates a Manifest instance from a dictionary (parsed from JSON).

Parameters:

data (dict)

Return type:

Manifest

parts: list[ManifestPartReference]

References to manifest parts.

to_json() str[source]
Return type:

str

version: str

Defines the version of the manifest schema.

class ManifestMetadataProvider(storage_provider: StorageProvider, manifest_path: str, writable: bool = False)[source]

Creates a ManifestMetadataProvider.

Parameters:
  • storage_provider (StorageProvider) – Storage provider.

  • manifest_path (str) – Main manifest file path.

  • writable (bool) – If true, allows modifications and new manifests to be written.

add_file(path: str, metadata: ObjectMetadata) None[source]

Add a file to be tracked by the MetadataProvider. Does not have to be reflected in listing until a MetadataProvider.commit_updates() forces a persist. This function must tolerate duplicate calls (idempotent behavior).

Parameters:
  • path (str) – User-supplied virtual path

  • metadata (ObjectMetadata) – physical file metadata from StorageProvider

Return type:

None

commit_updates() None[source]

Commit any newly adding files, used in conjunction with MetadataProvider.add_file(). MetadataProvider will persistently record any metadata changes.

Return type:

None

get_object_metadata(path: str, include_pending: bool = False) ObjectMetadata[source]

Retrieves metadata or information about an object stored in the provider.

Parameters:
  • path (str) – The path of the object.

  • include_pending (bool) – Whether to include metadata that is not yet committed.

Returns:

A metadata object containing the information about the object.

Return type:

ObjectMetadata

glob(pattern: str) list[str][source]

Matches and retrieves a list of object keys in the storage provider that match the specified pattern.

Parameters:

pattern (str) – The pattern to match object keys against, supporting wildcards (e.g., *.txt).

Returns:

A list of object keys that match the specified pattern.

Return type:

list[str]

is_writable() bool[source]

Returns True if the MetadataProvider supports writes else False.

Return type:

bool

list_objects(prefix: str, start_after: str | None = None, end_at: str | None = None, include_directories: bool = False) Iterator[ObjectMetadata][source]

Lists objects in the storage provider under the specified prefix.

Parameters:
  • prefix (str) – The prefix or path to list objects under.

  • start_after (str | None) – The key to start after (i.e. exclusive). An object with this key doesn’t have to exist.

  • end_at (str | None) – The key to end at (i.e. inclusive). An object with this key doesn’t have to exist.

  • include_directories (bool) – Whether to include directories in the result. When True, directories are returned alongside objects.

Returns:

A iterator over objects metadata under the specified prefix.

Return type:

Iterator[ObjectMetadata]

realpath(path: str) tuple[str, bool][source]

Returns the canonical, full real physical path for use by a StorageProvider. This provides translation from user-visible paths to the canonical paths needed by a StorageProvider.

Parameters:

path (str) – user-supplied virtual path

Returns:

A canonical physical path and if the object at the path is valid

Return type:

tuple[str, bool]

remove_file(path: str) None[source]

Remove a file tracked by the MetadataProvider. Does not have to be reflected in listing until a MetadataProvider.commit_updates() forces a persist. This function must tolerate duplicate calls (idempotent behavior).

Parameters:

path (str) – User-supplied virtual path

Return type:

None

class ManifestPartReference(path: str)[source]

A data class representing a reference to dataset manifest part.

Parameters:

path (str)

static from_dict(data: dict[str, Any]) ManifestPartReference[source]

Creates a ManifestPartReference instance from a dictionary.

Parameters:

data (dict[str, Any])

Return type:

ManifestPartReference

path: str

The path of the manifest part relative to the main manifest.

to_dict() dict[source]

Converts ManifestPartReference instance to a dictionary.

Return type:

dict

class AIStoreStorageProvider(endpoint: str = '', provider: str = 'ais', skip_verify: bool = True, ca_cert: str | None = None, timeout: float | tuple[float, float] | None = None, retry: dict[str, Any] | None = None, base_path: str = '', credentials_provider: CredentialsProvider | None = None, metric_counters: dict[CounterName, Counter] = {}, metric_gauges: dict[GaugeName, Gauge] = {}, metric_attributes_providers: Sequence[AttributesProvider] = (), **kwargs: Any)[source]

A concrete implementation of the multistorageclient.types.StorageProvider for interacting with NVIDIA AIStore.

AIStore client for managing buckets, objects, and ETL jobs.

Parameters:
  • endpoint (str) – The AIStore endpoint.

  • skip_verify (bool) – Whether to skip SSL certificate verification.

  • ca_cert (str | None) – Path to a CA certificate file for SSL verification.

  • timeout (float | tuple[float, float] | None) – Request timeout in seconds; a single float for both connect/read timeouts (e.g., 5.0), a tuple for separate connect/read timeouts (e.g., (3.0, 10.0)), or None to disable timeout.

  • retry (dict[str, Any] | None) – urllib3.util.Retry parameters.

  • token – Authorization token. If not provided, the AIS_AUTHN_TOKEN environment variable will be used.

  • base_path (str) – The root prefix path within the bucket where all operations will be scoped.

  • credentials_provider (CredentialsProvider | None) – The provider to retrieve AIStore credentials.

  • metric_counters (dict[CounterName, Counter]) – Metric counters.

  • metric_gauges (dict[GaugeName, Gauge]) – Metric gauges.

  • metric_attributes_providers (Sequence[AttributesProvider]) – Metric attributes providers.

  • provider (str)

  • kwargs (Any)

class StaticAISCredentialProvider(username: str | None = None, password: str | None = None, authn_endpoint: str | None = None, token: str | None = None, skip_verify: bool = True, ca_cert: str | None = None)[source]

A concrete implementation of the multistorageclient.types.CredentialsProvider that provides static S3 credentials.

Initializes the StaticAISCredentialProvider with the given credentials.

Parameters:
  • username (str | None) – The username for the AIStore authentication.

  • password (str | None) – The password for the AIStore authentication.

  • authn_endpoint (str | None) – The AIStore authentication endpoint.

  • token (str | None) – The AIStore authentication token. This is used for authentication if username, password and authn_endpoint are not provided.

  • skip_verify (bool) – If true, skip SSL certificate verification.

  • ca_cert (str | None) – Path to a CA certificate file for SSL verification.

get_credentials() Credentials[source]

Retrieves the current credentials.

Returns:

The current credentials used for authentication.

Return type:

Credentials

refresh_credentials() None[source]

Refreshes the credentials if they are expired or about to expire.

Return type:

None

class AzureBlobStorageProvider(endpoint_url: str, base_path: str = '', credentials_provider: CredentialsProvider | None = None, metric_counters: dict[CounterName, Counter] = {}, metric_gauges: dict[GaugeName, Gauge] = {}, metric_attributes_providers: Sequence[AttributesProvider] = (), **kwargs: dict[str, Any])[source]

A concrete implementation of the multistorageclient.types.StorageProvider for interacting with Azure Blob Storage.

Initializes the AzureBlobStorageProvider with the endpoint URL and optional credentials provider.

Parameters:
  • endpoint_url (str) – The Azure storage account URL.

  • base_path (str) – The root prefix path within the container where all operations will be scoped.

  • credentials_provider (CredentialsProvider | None) – The provider to retrieve Azure credentials.

  • metric_counters (dict[CounterName, Counter]) – Metric counters.

  • metric_gauges (dict[GaugeName, Gauge]) – Metric gauges.

  • metric_attributes_providers (Sequence[AttributesProvider]) – Metric attributes providers.

  • kwargs (dict[str, Any])

class StaticAzureCredentialsProvider(connection: str)[source]

A concrete implementation of the multistorageclient.types.CredentialsProvider that provides static Azure credentials.

Initializes the StaticAzureCredentialsProvider with the provided connection string.

Parameters:

connection (str) – The connection string for Azure Blob Storage authentication.

get_credentials() Credentials[source]

Retrieves the current credentials.

Returns:

The current credentials used for authentication.

Return type:

Credentials

refresh_credentials() None[source]

Refreshes the credentials if they are expired or about to expire.

Return type:

None

class GoogleIdentityPoolCredentialsProvider(audience: str, token_supplier: str)[source]

A concrete implementation of the multistorageclient.types.CredentialsProvider that provides Google’s identity pool credentials.

Initializes the GoogleIdentityPoolCredentials with the audience and token supplier.

Parameters:
  • audience (str) – The audience for the Google Identity Pool.

  • token_supplier (str) – The token supplier for the Google Identity Pool.

get_credentials() Credentials[source]

Retrieves the current credentials.

Returns:

The current credentials used for authentication.

Return type:

Credentials

refresh_credentials() None[source]

Refreshes the credentials if they are expired or about to expire.

Return type:

None

class GoogleStorageProvider(project_id: str = '', endpoint_url: str = '', base_path: str = '', credentials_provider: CredentialsProvider | None = None, metric_counters: dict[CounterName, Counter] = {}, metric_gauges: dict[GaugeName, Gauge] = {}, metric_attributes_providers: Sequence[AttributesProvider] = (), **kwargs: Any)[source]

A concrete implementation of the multistorageclient.types.StorageProvider for interacting with Google Cloud Storage.

Initializes the GoogleStorageProvider with the project ID and optional credentials provider.

Parameters:
  • project_id (str) – The Google Cloud project ID.

  • endpoint_url (str) – The custom endpoint URL for the GCS service.

  • base_path (str) – The root prefix path within the bucket where all operations will be scoped.

  • credentials_provider (CredentialsProvider | None) – The provider to retrieve GCS credentials.

  • metric_counters (dict[CounterName, Counter]) – Metric counters.

  • metric_gauges (dict[GaugeName, Gauge]) – Metric gauges.

  • metric_attributes_providers (Sequence[AttributesProvider]) – Metric attributes providers.

  • kwargs (Any)

class StringTokenSupplier(token: str)[source]

Supply a string token to the Google Identity Pool.

Parameters:

token (str)

get_subject_token(context, request)[source]

Returns the requested subject token. The subject token must be valid.

Args:
context (google.auth.externalaccount.SupplierContext): The context object

containing information about the requested audience and subject token type.

request (google.auth.transport.Request): The object used to make

HTTP requests.

Raises:
google.auth.exceptions.RefreshError: If an error is encountered during

subject token retrieval logic.

Returns:

str: The requested subject token string.

class GoogleS3StorageProvider(*args, **kwargs)[source]

A concrete implementation of the multistorageclient.types.StorageProvider for interacting with GCS via its S3 interface.

Initializes the S3StorageProvider with the region, endpoint URL, and optional credentials provider.

Parameters:
  • region_name – The AWS region where the S3 bucket is located.

  • endpoint_url – The custom endpoint URL for the S3 service.

  • base_path – The root prefix path within the S3 bucket where all operations will be scoped.

  • credentials_provider – The provider to retrieve S3 credentials.

  • metric_counters – Metric counters.

  • metric_gauges – Metric gauges.

  • metric_attributes_providers – Metric attributes providers.

class OracleStorageProvider(namespace: str, base_path: str = '', credentials_provider: CredentialsProvider | None = None, retry_strategy: dict[str, Any] | None = None, metric_counters: dict[CounterName, Counter] = {}, metric_gauges: dict[GaugeName, Gauge] = {}, metric_attributes_providers: Sequence[AttributesProvider] = (), **kwargs: Any)[source]

A concrete implementation of the multistorageclient.types.StorageProvider for interacting with Oracle Cloud Infrastructure (OCI) Object Storage.

Initializes an instance of OracleStorageProvider.

Parameters:
  • namespace (str) – The OCI Object Storage namespace. This is a unique identifier assigned to each tenancy.

  • base_path (str) – The root prefix path within the bucket where all operations will be scoped.

  • credentials_provider (CredentialsProvider | None) – The provider to retrieve OCI credentials.

  • retry_strategy (dict[str, Any] | None) – oci.retry.RetryStrategyBuilder parameters.

  • metric_counters (dict[CounterName, Counter]) – Metric counters.

  • metric_gauges (dict[GaugeName, Gauge]) – Metric gauges.

  • metric_attributes_providers (Sequence[AttributesProvider]) – Metric attributes providers.

  • kwargs (Any)

class S3StorageProvider(region_name: str = '', endpoint_url: str = '', base_path: str = '', credentials_provider: CredentialsProvider | None = None, metric_counters: dict[CounterName, Counter] = {}, metric_gauges: dict[GaugeName, Gauge] = {}, metric_attributes_providers: Sequence[AttributesProvider] = (), **kwargs: Any)[source]

A concrete implementation of the multistorageclient.types.StorageProvider for interacting with Amazon S3 or S3-compatible object stores.

Initializes the S3StorageProvider with the region, endpoint URL, and optional credentials provider.

Parameters:
  • region_name (str) – The AWS region where the S3 bucket is located.

  • endpoint_url (str) – The custom endpoint URL for the S3 service.

  • base_path (str) – The root prefix path within the S3 bucket where all operations will be scoped.

  • credentials_provider (CredentialsProvider | None) – The provider to retrieve S3 credentials.

  • metric_counters (dict[CounterName, Counter]) – Metric counters.

  • metric_gauges (dict[GaugeName, Gauge]) – Metric gauges.

  • metric_attributes_providers (Sequence[AttributesProvider]) – Metric attributes providers.

  • kwargs (Any)

class StaticS3CredentialsProvider(access_key: str, secret_key: str, session_token: str | None = None)[source]

A concrete implementation of the multistorageclient.types.CredentialsProvider that provides static S3 credentials.

Initializes the StaticS3CredentialsProvider with the provided access key, secret key, and optional session token.

Parameters:
  • access_key (str) – The access key for S3 authentication.

  • secret_key (str) – The secret key for S3 authentication.

  • session_token (str | None) – An optional session token for temporary credentials.

get_credentials() Credentials[source]

Retrieves the current credentials.

Returns:

The current credentials used for authentication.

Return type:

Credentials

refresh_credentials() None[source]

Refreshes the credentials if they are expired or about to expire.

Return type:

None

class S8KStorageProvider(*args, **kwargs)[source]

A concrete implementation of the multistorageclient.types.StorageProvider for interacting with SwiftStack.

Initializes the S3StorageProvider with the region, endpoint URL, and optional credentials provider.

Parameters:
  • region_name – The AWS region where the S3 bucket is located.

  • endpoint_url – The custom endpoint URL for the S3 service.

  • base_path – The root prefix path within the S3 bucket where all operations will be scoped.

  • credentials_provider – The provider to retrieve S3 credentials.

  • metric_counters – Metric counters.

  • metric_gauges – Metric gauges.

  • metric_attributes_providers – Metric attributes providers.

Telemetry

class Telemetry[source]

Provides telemetry resources.

Instances shouldn’t be copied between processes. Not fork-safe or pickleable.

Instances can be shared between processes by registering with a multiprocessing.managers.BaseManager and using proxy objects.

class TelemetryManager(address=None, authkey=None, serializer='pickle', ctx=None)[source]

A multiprocessing.managers.BaseManager for telemetry resources.

The OpenTelemetry Python SDK isn’t fork-safe since telemetry sample buffers can be duplicated.

In addition, Python ≤3.12 doesn’t call exit handlers for forked processes. This causes the OpenTelemetry Python SDK to not flush telemetry before exiting.

Forking is multiprocessing’s default start method for non-macOS POSIX systems until Python 3.14.

To fully support multiprocessing, resampling + publishing is handled by a single process that’s (ideally) a child of (i.e. directly under) the main process. This:

  • Relieves other processes of this work.

    • Avoids issues with duplicate samples when forking and unpublished samples when exiting forks.

  • Allows cross-process resampling.

  • Reuses a single connection pool to telemetry backends.

The downside is it essentially re-introduces global interpreter lock (GIL) with additional IPC overhead. Telemetry operations, however, should be lightweight so this isn’t expected to be a problem. Remote data store latency should still be the primary throughput limiter for storage clients.

multiprocessing.managers.BaseManager is used for this since it creates a separate server process for shared objects.

Telemetry resources are provided as proxy objects for location transparency.

The documentation isn’t particularly detailed, but others have written comprehensively on this:

By specification, metric and tracer providers must call shutdown on any underlying metric readers + span processors + exporters.

In the OpenTelemetry Python SDK, provider shutdown is called automatically by exit handlers (when they work at least). Consequently, clients should:

  • Only receive proxy objects.

    • Enables metric reader + span processor + exporter re-use across processes.

  • Never call shutdown on the proxy objects.

    • The shutdown exit handler is registered on the manager’s server process.

    • ⚠️ We expect a finite number of providers (i.e. no dynamic configs) so we don’t leak them.

class TelemetryMode(value)[source]

How to create a Telemetry object.

CLIENT = 'client'

Connect to a telemetry IPC server.

LOCAL = 'local'

Keep everything local to the process (not fork-safe).

SERVER = 'server'

Start + connect to a telemetry IPC server.

init(mode: TelemetryMode = TelemetryMode.SERVER, address: str | tuple[str, int] | None = None) Telemetry[source]

Create or return an existing Telemetry instance or Telemetry proxy object.

Parameters:
Returns:

A telemetry instance.

Return type:

Telemetry

Attributes

class AttributesProvider[source]

Provides opentelemetry.util.types.Attributes.

abstract attributes() Mapping[str, str | bool | int | float | Sequence[str] | Sequence[bool] | Sequence[int] | Sequence[float]] | None[source]

Collect attributes.

Return type:

Mapping[str, str | bool | int | float | Sequence[str] | Sequence[bool] | Sequence[int] | Sequence[float]] | None

class EnvironmentVariablesAttributesProvider(attributes: Mapping[str, str])[source]

Provides opentelemetry.util.types.Attributes from environment variables.

Parameters:

attributes (Mapping[str, str]) – Map of attribute key to environment variable key.

attributes() Mapping[str, str | bool | int | float | Sequence[str] | Sequence[bool] | Sequence[int] | Sequence[float]] | None[source]

Collect attributes.

Return type:

Mapping[str, str | bool | int | float | Sequence[str] | Sequence[bool] | Sequence[int] | Sequence[float]] | None

class HostAttributesProvider(attributes: Mapping[str, str])[source]

Provides opentelemetry.util.types.Attributes from host information.

Parameters:

attributes (Mapping[str, str]) – Map of attribute key to host attribute.

class HostAttribute(value)[source]

Host attribute.

Use the enum value in the attributes dictionary values.

NAME = 'name'

Hostname.

attributes() Mapping[str, str | bool | int | float | Sequence[str] | Sequence[bool] | Sequence[int] | Sequence[float]] | None[source]

Collect attributes.

Return type:

Mapping[str, str | bool | int | float | Sequence[str] | Sequence[bool] | Sequence[int] | Sequence[float]] | None

class MSCConfigAttributesProvider(attributes: Mapping[str, AttributeValueOptions], config_dict: Mapping[str, Any])[source]

Provides opentelemetry.util.types.Attributes from a multi-storage client configuration.

Parameters:
class AttributeValueOptions[source]

MSC configuration attribute value options.

expression: str

JMESPath expression.

Additional JMESPath functions:

  • hash(algorithm: str, value: str)
    • Calculate the hash digest of a value using a specific hash algorithm (e.g. sha3-256).

    • See hashlib.new() for algorithms.

attributes() Mapping[str, str | bool | int | float | Sequence[str] | Sequence[bool] | Sequence[int] | Sequence[float]] | None[source]

Collect attributes.

Return type:

Mapping[str, str | bool | int | float | Sequence[str] | Sequence[bool] | Sequence[int] | Sequence[float]] | None

class ProcessAttributesProvider(attributes: Mapping[str, str])[source]

Provides opentelemetry.util.types.Attributes from current process information.

Parameters:

attributes (Mapping[str, str]) – Map of attribute key to process attribute.

class ProcessAttribute(value)[source]

Process attribute.

Use the enum value in the attributes dictionary values.

PID = 'pid'

Process ID.

attributes() Mapping[str, str | bool | int | float | Sequence[str] | Sequence[bool] | Sequence[int] | Sequence[float]] | None[source]

Collect attributes.

Return type:

Mapping[str, str | bool | int | float | Sequence[str] | Sequence[bool] | Sequence[int] | Sequence[float]] | None

class StaticAttributesProvider(attributes: Mapping[str, str | bool | int | float | Sequence[str] | Sequence[bool] | Sequence[int] | Sequence[float]] | None)[source]

Provides opentelemetry.util.types.Attributes from static attributes.

Parameters:

attributes (Mapping[str, str | bool | int | float | Sequence[str] | Sequence[bool] | Sequence[int] | Sequence[float]] | None) – Map of attribute key to static attribute value.

attributes() Mapping[str, str | bool | int | float | Sequence[str] | Sequence[bool] | Sequence[int] | Sequence[float]] | None[source]

Collect attributes.

Return type:

Mapping[str, str | bool | int | float | Sequence[str] | Sequence[bool] | Sequence[int] | Sequence[float]] | None

class ThreadAttributesProvider(attributes: Mapping[str, str])[source]

Provides opentelemetry.util.types.Attributes from current thread information.

Parameters:

attributes (Mapping[str, str]) – Map of attribute key to thread attribute.

class ThreadAttribute(value)[source]

Thread attribute.

Use the enum value in the attributes dictionary values.

IDENT = 'ident'

Python thread ID.

NATIVE_ID = 'native_id'

OS thread ID.

attributes() Mapping[str, str | bool | int | float | Sequence[str] | Sequence[bool] | Sequence[int] | Sequence[float]] | None[source]

Collect attributes.

Return type:

Mapping[str, str | bool | int | float | Sequence[str] | Sequence[bool] | Sequence[int] | Sequence[float]] | None

Metrics

Readers

class DiperiodicExportingMetricReader(exporter: MetricExporter, collect_interval_millis: float | None = None, collect_timeout_millis: float | None = None, export_interval_millis: float | None = None, export_timeout_millis: float | None = None)[source]

opentelemetry.sdk.metrics.export.MetricReader that collects + exports metrics on separate user-configurable time intervals. This is in contrast with opentelemetry.sdk.metrics.export.PeriodicExportingMetricReader which couples them with a 1 minute default.

The metrics collection interval limits the temporal resolution. Most metric backends have 1 millisecond or finer temporal resolution.

Parameters:
  • exporter (MetricExporter) – Metrics exporter.

  • collect_interval_millis (float | None) – Collect interval in milliseconds.

  • collect_timeout_millis (float | None) – Collect timeout in milliseconds.

  • export_interval_millis (float | None) – Export interval in milliseconds.

  • export_timeout_millis (float | None) – Export timeout in milliseconds.

force_flush(timeout_millis: float = 40000) bool[source]
Parameters:

timeout_millis (float)

Return type:

bool

shutdown(timeout_millis: float = 40000, **kwargs) None[source]

Shuts down the MetricReader. This method provides a way for the MetricReader to do any cleanup required. A metric reader can only be shutdown once, any subsequent calls are ignored and return failure status.

When a MetricReader is registered on a MeterProvider, shutdown() will invoke this automatically.

Parameters:

timeout_millis (float)

Return type:

None

Generators

class ManifestMetadataGenerator[source]

Generates a file metadata manifest for use with a multistorageclient.providers.ManifestMetadataProvider.

static generate_and_write_manifest(data_storage_client: StorageClient, manifest_storage_client: StorageClient, partition_keys: list[str] | None = None) None[source]

Generates a file metadata manifest.

The data storage client’s base path should be set to the root path for data objects (e.g. my-bucket/my-data-prefix).

The manifest storage client’s base path should be set to the root path for manifest objects (e.g. my-bucket/my-manifest-prefix).

The following manifest objects will be written with the destination storage client (with the total number of manifest parts being variable):

.msc_manifests/
├── msc_manifest_index.json
└── parts/
    ├── msc_manifest_part000001.jsonl
    ├── ...
    └── msc_manifest_part999999.jsonl
Parameters:
  • data_storage_client (StorageClient) – Storage client for reading data objects.

  • manifest_storage_client (StorageClient) – Storage client for writing manifest objects.

  • partition_keys (list[str] | None) – Optional list of keys to partition the listing operation. If provided, objects will be listed concurrently using these keys as boundaries.

Return type:

None

Higher-Level Libraries

fsspec

class MultiStorageAsyncFileSystem(*args, **kwargs)[source]

Custom fsspec.asyn.AsyncFileSystem implementation for MSC protocol (msc://). Uses multistorageclient.StorageClient for backend operations.

Initializes the MultiStorageAsyncFileSystem.

Parameters:

kwargs – Additional arguments for the fsspec.asyn.AsyncFileSystem.

static asynchronize_sync(func: Callable[[...], Any], *args: Any, **kwargs: Any) Any[source]

Runs a synchronous function asynchronously using asyncio.

Parameters:
  • func (Callable[[...], Any]) – The synchronous function to be executed asynchronously.

  • args (Any) – Positional arguments to pass to the function.

  • kwargs (Any) – Keyword arguments to pass to the function.

Returns:

The result of the asynchronous execution of the function.

Return type:

Any

cat_file(path: str, **kwargs: Any) bytes[source]

Reads the contents of a file at the given path.

Parameters:
  • path (str) – The file path to read from.

  • kwargs (Any) – Additional arguments for file reading functionality.

Returns:

The contents of the file as bytes.

Return type:

bytes

cp_file(path1: str, path2: str, **kwargs: Any)[source]

Copies a file from the source path to the destination path.

Parameters:
  • path1 (str) – The source file path.

  • path2 (str) – The destination file path.

  • kwargs (Any) – Additional arguments for copy functionality.

Raises:

AttributeError – If the source and destination paths are associated with different profiles.

get_file(rpath: str, lpath: str, **kwargs: Any) None[source]

Downloads a file from the remote path to the local path.

Parameters:
  • rpath (str) – The remote path of the file to download.

  • lpath (str) – The local path to store the file.

  • kwargs (Any) – Additional arguments for file retrieval functionality.

Return type:

None

info(path: str, **kwargs: Any) dict[str, Any][source]

Retrieves metadata information for a file.

Parameters:
  • path (str) – The file path to retrieve information for.

  • kwargs (Any) – Additional arguments for info functionality.

Returns:

A dictionary containing file metadata such as ETag, last modified, and size.

Return type:

dict[str, Any]

ls(path: str, detail: bool = True, **kwargs: Any) list[dict[str, Any]] | list[str][source]

Lists the contents of a directory.

Parameters:
  • path (str) – The directory path to list.

  • detail (bool) – Whether to return detailed information for each file.

  • kwargs (Any) – Additional arguments for list functionality.

Returns:

A list of file names or detailed information depending on the ‘detail’ argument.

Return type:

list[dict[str, Any]] | list[str]

open(path: str, mode: str = 'rb', **kwargs: Any) PosixFile | ObjectFile[source]

Opens a file at the given path.

Parameters:
  • path (str) – The file path to open.

  • mode (str) – The mode in which to open the file.

  • kwargs (Any) – Additional arguments for file opening.

Returns:

A ManagedFile object representing the opened file.

Return type:

PosixFile | ObjectFile

pipe_file(path: str, value: bytes, **kwargs: Any) None[source]

Writes a value (bytes) directly to a file at the given path.

Parameters:
  • path (str) – The file path to write the value to.

  • value (bytes) – The bytes to write to the file.

  • kwargs (Any) – Additional arguments for writing functionality.

Return type:

None

protocol: ClassVar[str | tuple[str, ...]] = 'msc'
put_file(lpath: str, rpath: str, **kwargs: Any) None[source]

Uploads a local file to the remote path.

Parameters:
  • lpath (str) – The local path of the file to upload.

  • rpath (str) – The remote path to store the file.

  • kwargs (Any) – Additional arguments for file upload functionality.

Return type:

None

resolve_path_and_storage_client(path: str | PathLike) tuple[StorageClient, str][source]

Resolves the path and retrieves the associated multistorageclient.StorageClient.

Parameters:

path (str | PathLike) – The file path to resolve.

Returns:

A tuple containing the multistorageclient.StorageClient and the resolved path.

Return type:

tuple[StorageClient, str]

rm_file(path: str, **kwargs: Any)[source]

Removes a file.

Parameters:
  • path (str) – The file or directory path to remove.

  • kwargs (Any) – Additional arguments for remove functionality.

NumPy

load(*args: Any, **kwargs: Any) ndarray | dict[str, ndarray] | NpzFile[source]

Adapt numpy.load.

Parameters:
Return type:

ndarray | dict[str, ndarray] | NpzFile

memmap(*args: Any, **kwargs: Any) memmap[source]

Adapt numpy.memmap.

Parameters:
Return type:

memmap

save(*args: Any, **kwargs: Any) None[source]

Adapt numpy.save.

Parameters:
Return type:

None

PyTorch

class MultiStorageFileSystem[source]

A filesystem implementation that uses the MultiStoragePath class to handle paths.

concat_path(path: str | PathLike, suffix: str) str | PathLike[source]
Parameters:
Return type:

str | PathLike

create_stream(path: str | PathLike, mode: str) Generator[IOBase, None, None][source]
Parameters:
Return type:

Generator[IOBase, None, None]

exists(path: str | PathLike) bool[source]
Parameters:

path (str | PathLike)

Return type:

bool

init_path(path: str | PathLike) str | PathLike[source]
Parameters:

path (str | PathLike)

Return type:

str | PathLike

ls(path: str | PathLike) list[str][source]
Parameters:

path (str | PathLike)

Return type:

list[str]

mkdir(path: str | PathLike) None[source]
Parameters:

path (str | PathLike)

Return type:

None

rename(path: str | PathLike, new_path: str | PathLike) None[source]
Parameters:
Return type:

None

rm_file(path: str | PathLike) None[source]
Parameters:

path (str | PathLike)

Return type:

None

classmethod validate_checkpoint_id(checkpoint_id: str | PathLike) bool[source]
Parameters:

checkpoint_id (str | PathLike)

Return type:

bool

class MultiStorageFileSystemReader(path: str | PathLike, thread_count: int = 1)[source]

A reader implementation that uses the MultiStorageFileSystem class to handle file system operations.

Initialize the MultiStorageFileSystemReader with the MultiStorageFileSystem.

Parameters:
  • path (str | PathLike) – The path to the checkpoint.

  • thread_count (int) – The number of threads to use for prefetching.

read_data(plan: LoadPlan, planner: LoadPlanner) Future[None][source]

Override the method to prefetch objects from object storage.

Parameters:
  • plan (LoadPlan)

  • planner (LoadPlanner)

Return type:

Future[None]

classmethod validate_checkpoint_id(checkpoint_id: str | PathLike) bool[source]

Check if the given checkpoint_id is supported by the stroage. This allow us to enable automatic storage selection.

Parameters:

checkpoint_id (str | PathLike)

Return type:

bool

class MultiStorageFileSystemWriter(path: str | PathLike, single_file_per_rank: bool = True, sync_files: bool = True, thread_count: int = 1, per_thread_copy_ahead: int = 10000000, cache_staged_state_dict: bool = False, overwrite: bool = True)[source]

A writer implementation that uses the MultiStorageFileSystem class to handle file system operations.

Initialize the MultiStorageFileSystemWriter with the MultiStorageFileSystem.

Parameters:
classmethod validate_checkpoint_id(checkpoint_id: str | PathLike) bool[source]

Check if the given checkpoint_id is supported by the stroage. This allow us to enable automatic storage selection.

Parameters:

checkpoint_id (str | PathLike)

Return type:

bool

load(f: str | PathLike[str] | IO[bytes], *args: Any, **kwargs: Any) Any[source]

Adapt torch.load.

Parameters:
Return type:

Any

save(obj: object, f: str | PathLike[str] | IO[bytes], *args: Any, **kwargs: Any) Any[source]

Adapt torch.save.

Parameters:
Return type:

Any

Xarray

open_zarr(*args: Any, **kwargs: Any) Dataset[source]

Adapt xarray.open_zarr to use multistorageclient.contrib.zarr.LazyZarrStore when path matches the msc protocol.

If the path starts with the MSC protocol, it uses multistorageclient.contrib.zarr.LazyZarrStore with a resolved storage client and prefix, passing msc_max_workers if provided. Otherwise, it directly calls xarray.open_zarr.

Parameters:
Return type:

Dataset

Zarr

class LazyZarrStore(storage_client: StorageClient, prefix: str = '', msc_max_workers: int | None = None)[source]
Parameters:
getitems(keys: Sequence[str], *, contexts: Any) Mapping[str, Any][source]

Retrieve data from multiple keys.

Parameters

keysIterable[str]

The keys to retrieve

contexts: Mapping[str, Context]

A mapping of keys to their context. Each context is a mapping of store specific information. E.g. a context could be a dict telling the store the preferred output array type: {“meta_array”: cupy.empty(())}

Returns

Mapping

A collection mapping the input keys to their results.

Notes

This default implementation uses __getitem__() to read each key sequentially and ignores contexts. Overwrite this method to implement concurrent reads of multiple keys and/or to utilize the contexts.

Parameters:
Return type:

Mapping[str, Any]

keys() a set-like object providing a view on D's keys[source]
Return type:

Iterator[str]

open_consolidated(*args: Any, **kwargs: Any) Group[source]

Adapt zarr.open_consolidated to use LazyZarrStore when path matches the msc protocol.

If the path starts with the MSC protocol, it uses LazyZarrStore with a resolved storage client and prefix, passing msc_max_workers if provided. Otherwise, it directly calls zarr.open_consolidated.

Parameters:
Return type:

Group