API Reference¶
Core¶
- class multistorageclient.CacheConfig(size: str, use_etag: bool = True, eviction_policy: ~multistorageclient.caching.cache_config.EvictionPolicyConfig = <factory>, backend: ~multistorageclient.caching.cache_config.CacheBackendConfig = <factory>)[source]¶
Configuration for the CacheManager.
This class defines the complete configuration for the cache system, including size limits, etag usage, eviction policy, and backend settings.
- Parameters:
- backend: CacheBackendConfig¶
Cache backend configuration. Default is filesystem.
- eviction_policy: EvictionPolicyConfig¶
Cache eviction policy configuration. Default is LRU with 300s refresh.
- get_eviction_policy() str [source]¶
Get the eviction policy.
- Returns:
The current eviction policy type.
- Return type:
- get_storage_provider_profile() str | None [source]¶
Get the storage provider profile.
- Returns:
The storage provider profile name if set, None otherwise.
- Return type:
str | None
- multistorageclient.Path¶
alias of
MultiStoragePath
- class multistorageclient.StorageClient(config: StorageClientConfig)[source]¶
A client for interacting with different storage providers.
Initializes the
StorageClient
with the given configuration.- Parameters:
config (StorageClientConfig) – The configuration object for the storage client.
- commit_metadata(prefix: str | None = None) None [source]¶
Commits any pending updates to the metadata provider. No-op if not using a metadata provider.
- Parameters:
prefix (str | None) – If provided, scans the prefix to find files to commit.
- Return type:
None
- copy(src_path: str, dest_path: str) None [source]¶
Copies an object from source to destination in the storage provider.
- delete(path: str, recursive: bool = False) None [source]¶
Deletes an object from the storage provider at the specified path.
- glob(pattern: str, include_url_prefix: bool = False) list[str] [source]¶
Matches and retrieves a list of objects in the storage provider that match the specified pattern.
- info(path: str, strict: bool = True) ObjectMetadata [source]¶
Retrieves metadata or information about an object stored at the specified path.
- Parameters:
- Returns:
A dictionary containing metadata about the object.
- Return type:
- is_default_profile() bool [source]¶
Return True if the storage client is using the default profile.
- Return type:
- is_empty(path: str) bool [source]¶
Checks whether the specified path is empty. A path is considered empty if there are no objects whose keys start with the given path as a prefix.
- is_file(path: str) bool [source]¶
Checks whether the specified path points to a file (rather than a directory or folder).
- list(prefix: str = '', start_after: str | None = None, end_at: str | None = None, include_directories: bool = False, include_url_prefix: bool = False) Iterator[ObjectMetadata] [source]¶
Lists objects in the storage provider under the specified prefix.
- Parameters:
prefix (str) – The prefix to list objects under.
start_after (str | None) – The key to start after (i.e. exclusive). An object with this key doesn’t have to exist.
end_at (str | None) – The key to end at (i.e. inclusive). An object with this key doesn’t have to exist.
include_directories (bool) – Whether to include directories in the result. When True, directories are returned alongside objects.
include_url_prefix (bool) – Whether to include the URL prefix
msc://profile
in the result.
- Returns:
An iterator over objects.
- Return type:
- open(path: str, mode: str = 'rb', buffering: int = -1, encoding: str | None = None, disable_read_cache: bool = False, memory_load_limit: int = 536870912, atomic: bool = True, check_source_version: SourceVersionCheckMode = SourceVersionCheckMode.INHERIT) PosixFile | ObjectFile [source]¶
Returns a file-like object from the storage provider at the specified path.
- Parameters:
path (str) – The path of the object to read.
mode (str) – The file mode, only “w”, “r”, “a”, “wb”, “rb” and “ab” are supported.
buffering (int) – The buffering mode. Only applies to PosixFile.
encoding (str | None) – The encoding to use for text files.
disable_read_cache (bool) – When set to True, disables caching for the file content. This parameter is only applicable to ObjectFile when the mode is “r” or “rb”.
memory_load_limit (int) – Size limit in bytes for loading files into memory. Defaults to 512MB. This parameter is only applicable to ObjectFile when the mode is “r” or “rb”.
atomic (bool) – When set to True, the file will be written atomically (rename upon close). This parameter is only applicable to PosixFile in write mode.
check_source_version (SourceVersionCheckMode) – Whether to check the source version of cached objects.
- Returns:
A file-like object (PosixFile or ObjectFile) for the specified path.
- Return type:
PosixFile | ObjectFile
- sync_from(source_client: StorageClient, source_path: str = '', target_path: str = '', delete_unmatched_files: bool = False, description: str = 'Syncing', num_worker_processes: int | None = None) None [source]¶
Syncs files from the source storage client to “path/”.
- Parameters:
source_client (StorageClient) – The source storage client.
source_path (str) – The path to sync from.
target_path (str) – The path to sync to.
delete_unmatched_files (bool) – Whether to delete files at the target that are not present at the source.
description (str) – Description of sync process for logging purposes.
num_worker_processes (int | None) – The number of worker processes to use.
- Return type:
None
- class multistorageclient.StorageClientConfig(profile: str, storage_provider: StorageProvider, credentials_provider: CredentialsProvider | None = None, metadata_provider: MetadataProvider | None = None, cache_config: CacheConfig | None = None, cache_manager: CacheBackend | None = None, retry_config: RetryConfig | None = None)[source]¶
Configuration class for the
multistorageclient.StorageClient
.- Parameters:
profile (str)
storage_provider (StorageProvider)
credentials_provider (CredentialsProvider | None)
metadata_provider (MetadataProvider | None)
cache_config (CacheConfig | None)
cache_manager (CacheBackend | None)
retry_config (RetryConfig | None)
- cache_config: CacheConfig | None¶
- credentials_provider: CredentialsProvider | None¶
- static from_dict(config_dict: dict[str, Any], profile: str = 'default', skip_validation: bool = False, telemetry: Telemetry | None = None) StorageClientConfig [source]¶
- static from_file(profile: str = 'default', telemetry: Telemetry | None = None) StorageClientConfig [source]¶
- Parameters:
- Return type:
- static from_json(config_json: str, profile: str = 'default', telemetry: Telemetry | None = None) StorageClientConfig [source]¶
- Parameters:
- Return type:
- static from_provider_bundle(config_dict: dict[str, Any], provider_bundle: ProviderBundle, telemetry: Telemetry | None = None) StorageClientConfig [source]¶
- Parameters:
provider_bundle (ProviderBundle)
telemetry (Telemetry | None)
- Return type:
- static from_yaml(config_yaml: str, profile: str = 'default', telemetry: Telemetry | None = None) StorageClientConfig [source]¶
- Parameters:
- Return type:
- metadata_provider: MetadataProvider | None¶
- static read_path_mapping() PathMapping [source]¶
Get the path mapping defined in the MSC configuration.
Path mappings create a nested structure of protocol -> bucket -> [(prefix, profile)] where entries are sorted by prefix length (longest first) for optimal matching. Longer paths take precedence when matching.
- Returns:
A PathMapping instance with translation mappings
- Return type:
PathMapping
- retry_config: RetryConfig | None¶
- storage_provider: StorageProvider¶
- multistorageclient.commit_metadata(url: str) None [source]¶
Commits the metadata updates for the specified storage client profile.
- Parameters:
url (str) – The URL of the path to commit metadata for.
- Return type:
None
- multistorageclient.delete(url: str, recursive: bool = False) None [source]¶
Deletes the specified object(s) from the storage provider.
This function retrieves the corresponding
multistorageclient.StorageClient
for the given URL and deletes the object(s) at the specified path.
- multistorageclient.download_file(url: str, local_path: str) None [source]¶
Download a file in a given remote_path to a local path
The function utilizes the
multistorageclient.StorageClient
to download a file (object) at the provided path. The URL is parsed, and the correspondingmultistorageclient.StorageClient
is retrieved or built.- Parameters:
- Raises:
ValueError – If the URL’s protocol does not match the expected protocol
msc
.- Return type:
None
- multistorageclient.get_telemetry() Telemetry | None [source]¶
Get the :py:class:
Telemetry
instance to use for storage clients created by shortcuts.- Returns:
A telemetry instance.
- Return type:
Telemetry | None
- multistorageclient.glob(pattern: str) list[str] [source]¶
Return a list of files matching a pattern.
This function supports glob-style patterns for matching multiple files within a storage system. The pattern is parsed, and the associated
multistorageclient.StorageClient
is used to retrieve the list of matching files.- Parameters:
pattern (str) – The glob-style pattern to match files. (example:
msc://profile/prefix/**/*.tar
)- Returns:
A list of file paths matching the pattern.
- Raises:
ValueError – If the URL’s protocol does not match the expected protocol
msc
.- Return type:
- multistorageclient.is_empty(url: str) bool [source]¶
Checks whether the specified URL contains any objects.
- Parameters:
url (str) – The URL to check, typically pointing to a storage location.
- Returns:
True
if there are no objects/files under this URL,False
otherwise.- Raises:
ValueError – If the URL’s protocol does not match the expected protocol
msc
.- Return type:
- multistorageclient.is_file(url: str) bool [source]¶
Checks whether the specified url points to a file (rather than a directory or folder).
The function utilizes the
multistorageclient.StorageClient
to check if a file (object) exists at the provided path. The URL is parsed, and the correspondingmultistorageclient.StorageClient
is retrieved or built.
- multistorageclient.list(url: str, start_after: str | None = None, end_at: str | None = None, include_directories: bool = False) Iterator[ObjectMetadata] [source]¶
Lists the contents of the specified URL prefix.
This function retrieves the corresponding
multistorageclient.StorageClient
for the given URL and returns an iterator of objects (files or directories) stored under the provided prefix.- Parameters:
url (str) – The prefix to list objects under.
start_after (str | None) – The key to start after (i.e. exclusive). An object with this key doesn’t have to exist.
end_at (str | None) – The key to end at (i.e. inclusive). An object with this key doesn’t have to exist.
include_directories (bool) – Whether to include directories in the result. When True, directories are returned alongside objects.
- Returns:
An iterator of
ObjectMetadata
objects representing the files (and optionally directories) accessible under the specified URL prefix. The returned keys will always be prefixed with msc://.- Return type:
- multistorageclient.open(url: str, mode: str = 'rb', **kwargs: Any) PosixFile | ObjectFile [source]¶
Open a file at the given URL using the specified mode.
The function utilizes the
multistorageclient.StorageClient
to open a file at the provided path. The URL is parsed, and the correspondingmultistorageclient.StorageClient
is retrieved or built.- Parameters:
- Returns:
A file-like object that allows interaction with the file.
- Raises:
ValueError – If the URL’s protocol does not match the expected protocol
msc
.- Return type:
PosixFile | ObjectFile
- multistorageclient.resolve_storage_client(url: str) tuple[StorageClient, str] [source]¶
Build and return a
multistorageclient.StorageClient
instance based on the provided URL or path.This function parses the given URL or path and determines the appropriate storage profile and path. It supports URLs with the protocol
msc://
, as well as POSIX paths orfile://
URLs for local file system access. If the profile has already been instantiated, it returns the cached client. Otherwise, it creates a newStorageClient
and caches it.The function also supports implicit profiles for non-MSC URLs. When a non-MSC URL is provided (like s3://, gs://, ais://, file://), MSC will infer the storage provider based on the URL protocol and create an implicit profile with the naming convention “_protocol-bucket” (e.g., “_s3-bucket1”, “_gs-bucket1”).
Path mapping defined in the MSC configuration are also applied before creating implicit profiles. This allows for explicit mappings between source paths and destination MSC profiles.
- Parameters:
url (str) – The storage location, which can be: - A URL in the format
msc://profile/path
for object storage. - A local file system path (absolute POSIX path) or afile://
URL. - A non-MSC URL with a supported protocol (s3://, gs://, ais://).- Returns:
A tuple containing the
multistorageclient.StorageClient
instance and the parsed path.- Raises:
ValueError – If the URL’s protocol is neither
msc
nor a valid local file system path or a supported non-MSC protocol.- Return type:
- multistorageclient.set_telemetry(telemetry: Telemetry | None) None [source]¶
Set the :py:class:
Telemetry
instance to use for storage clients created by shortcuts.- Parameters:
telemetry (Telemetry | None) – A telemetry instance.
- Return type:
None
- multistorageclient.sync(source_url: str, target_url: str, delete_unmatched_files: bool = False) None [source]¶
Syncs files from the source storage to the target storage.
- multistorageclient.upload_file(url: str, local_path: str) None [source]¶
Upload a file to the given URL from a local path.
The function utilizes the
multistorageclient.StorageClient
to upload a file (object) to the provided path. The URL is parsed, and the correspondingmultistorageclient.StorageClient
is retrieved or built.- Parameters:
- Raises:
ValueError – If the URL’s protocol does not match the expected protocol
msc
.- Return type:
None
Types¶
- class multistorageclient.types.Credentials(access_key: str, secret_key: str, token: str | None, expiration: str | None, custom_fields: dict[str, ~typing.Any] = <factory>)[source]¶
A data class representing the credentials needed to access a storage provider.
- Parameters:
- get_custom_field(key: str, default: Any | None = None) Any [source]¶
Retrieves a value from custom fields by its key.
- class multistorageclient.types.CredentialsProvider[source]¶
Abstract base class for providing credentials to access a storage provider.
- abstract get_credentials() Credentials [source]¶
Retrieves the current credentials.
- Returns:
The current credentials used for authentication.
- Return type:
- class multistorageclient.types.MetadataProvider[source]¶
Abstract base class for accessing file metadata.
- abstract add_file(path: str, metadata: ObjectMetadata) None [source]¶
Add a file to be tracked by the
MetadataProvider
. Does not have to be reflected in listing until aMetadataProvider.commit_updates()
forces a persist. This function must tolerate duplicate calls (idempotent behavior).- Parameters:
path (str) – User-supplied virtual path
metadata (ObjectMetadata) – physical file metadata from StorageProvider
- Return type:
None
- abstract commit_updates() None [source]¶
Commit any newly adding files, used in conjunction with
MetadataProvider.add_file()
.MetadataProvider
will persistently record any metadata changes.- Return type:
None
- abstract get_object_metadata(path: str, include_pending: bool = False) ObjectMetadata [source]¶
Retrieves metadata or information about an object stored in the provider.
- Parameters:
- Returns:
A metadata object containing the information about the object.
- Return type:
- abstract glob(pattern: str) list[str] [source]¶
Matches and retrieves a list of object keys in the storage provider that match the specified pattern.
- abstract is_writable() bool [source]¶
Returns
True
if theMetadataProvider
supports writes elseFalse
.- Return type:
- abstract list_objects(prefix: str, start_after: str | None = None, end_at: str | None = None, include_directories: bool = False) Iterator[ObjectMetadata] [source]¶
Lists objects in the storage provider under the specified prefix.
- Parameters:
prefix (str) – The prefix or path to list objects under.
start_after (str | None) – The key to start after (i.e. exclusive). An object with this key doesn’t have to exist.
end_at (str | None) – The key to end at (i.e. inclusive). An object with this key doesn’t have to exist.
include_directories (bool) – Whether to include directories in the result. When True, directories are returned alongside objects.
- Returns:
A iterator over objects metadata under the specified prefix.
- Return type:
- abstract realpath(path: str) tuple[str, bool] [source]¶
Returns the canonical, full real physical path for use by a
StorageProvider
. This provides translation from user-visible paths to the canonical paths needed by aStorageProvider
.
- abstract remove_file(path: str) None [source]¶
Remove a file tracked by the
MetadataProvider
. Does not have to be reflected in listing until aMetadataProvider.commit_updates()
forces a persist. This function must tolerate duplicate calls (idempotent behavior).- Parameters:
path (str) – User-supplied virtual path
- Return type:
None
- exception multistorageclient.types.NotModifiedError[source]¶
Raised when a conditional operation fails because the resource has not been modified.
This typically occurs when using if-none-match with a specific generation/etag and the resource’s current generation/etag matches the specified one.
- class multistorageclient.types.ObjectMetadata(key: str, content_length: int, last_modified: datetime, type: str = 'file', content_type: str | None = None, etag: str | None = None, storage_class: str | None = None, metadata: dict[str, Any] | None = None)[source]¶
A data class that represents the metadata associated with an object stored in a cloud storage service. This metadata includes both required and optional information about the object.
- Parameters:
- static from_dict(data: dict) ObjectMetadata [source]¶
Creates an ObjectMetadata instance from a dictionary (parsed from JSON).
- Parameters:
data (dict)
- Return type:
- exception multistorageclient.types.PreconditionFailedError[source]¶
Exception raised when a precondition fails. e.g. if-match, if-none-match, etc.
- class multistorageclient.types.ProviderBundle[source]¶
Abstract base class that serves as a container for various providers (storage, credentials, and metadata) that interact with a storage service. The
ProviderBundle
abstracts access to these providers, allowing for flexible implementations of cloud storage solutions.- abstract property credentials_provider: CredentialsProvider | None¶
- Returns:
The credentials provider responsible for managing authentication credentials required to access the storage service.
- abstract property metadata_provider: MetadataProvider | None¶
- Returns:
The metadata provider responsible for retrieving metadata about objects in the storage service.
- abstract property storage_provider_config: StorageProviderConfig¶
- Returns:
The configuration for the storage provider, which includes the provider name/type and additional options.
- class multistorageclient.types.RetryConfig(attempts: int = 3, delay: float = 1.0)[source]¶
A data class that represents the configuration for retry strategy.
- exception multistorageclient.types.RetryableError[source]¶
Exception raised for errors that should trigger a retry.
- class multistorageclient.types.SourceVersionCheckMode(value)[source]¶
Enum for controlling source version checking behavior.
- DISABLE = 'disable'¶
- ENABLE = 'enable'¶
- INHERIT = 'inherit'¶
- class multistorageclient.types.StorageProvider[source]¶
Abstract base class for interacting with a storage provider.
- abstract copy_object(src_path: str, dest_path: str) None [source]¶
Copies an object from source to destination in the storage provider.
- abstract delete_object(path: str, if_match: str | None = None) None [source]¶
Deletes an object from the storage provider.
- abstract download_file(remote_path: str, f: str | IO, metadata: ObjectMetadata | None = None) None [source]¶
Downloads a file from the storage provider to the local file system.
- Parameters:
remote_path (str) – The path of the file to download.
f (str | IO) – The destination for the downloaded file. This can either be a string representing the local file path where the file will be saved, or a file-like object to write the downloaded content into.
metadata (ObjectMetadata | None) – Metadata about the object to download.
- Return type:
None
- abstract get_object(path: str, byte_range: Range | None = None) bytes [source]¶
Retrieves an object from the storage provider.
- abstract get_object_metadata(path: str, strict: bool = True) ObjectMetadata [source]¶
Retrieves metadata or information about an object stored in the provider.
- Parameters:
- Returns:
A metadata object containing the information about the object.
- Return type:
- abstract glob(pattern: str) list[str] [source]¶
Matches and retrieves a list of object keys in the storage provider that match the specified pattern.
- abstract is_file(path: str) bool [source]¶
Checks whether the specified key in the storage provider points to a file (as opposed to a folder or directory).
- abstract list_objects(prefix: str, start_after: str | None = None, end_at: str | None = None, include_directories: bool = False) Iterator[ObjectMetadata] [source]¶
Lists objects in the storage provider under the specified prefix.
- Parameters:
prefix (str) – The prefix or path to list objects under.
start_after (str | None) – The key to start after (i.e. exclusive). An object with this key doesn’t have to exist.
end_at (str | None) – The key to end at (i.e. inclusive). An object with this key doesn’t have to exist.
include_directories (bool) – Whether to include directories in the result. When True, directories are returned alongside objects.
- Returns:
An iterator over objects metadata under the specified prefix.
- Return type:
- abstract put_object(path: str, body: bytes, metadata: dict[str, str] | None = None, if_match: str | None = None, if_none_match: str | None = None) None [source]¶
Uploads an object to the storage provider.
- class multistorageclient.types.StorageProviderConfig(type: str, options: dict[str, Any] | None = None)[source]¶
A data class that represents the configuration needed to initialize a storage provider.
Providers¶
- class multistorageclient.providers.posix_file.PosixFileStorageProvider(base_path: str, metric_counters: dict[CounterName, Counter] = {}, metric_gauges: dict[GaugeName, Gauge] = {}, metric_attributes_providers: Sequence[AttributesProvider] = (), **kwargs: Any)[source]¶
A concrete implementation of the
multistorageclient.types.StorageProvider
for interacting with POSIX file systems.- Parameters:
base_path (str) – The root prefix path within the POSIX file system where all operations will be scoped.
metric_counters (dict[CounterName, Counter]) – Metric counters.
metric_gauges (dict[GaugeName, Gauge]) – Metric gauges.
metric_attributes_providers (Sequence[AttributesProvider]) – Metric attributes providers.
kwargs (Any)
- glob(pattern: str) list[str] [source]¶
Matches and retrieves a list of object keys in the storage provider that match the specified pattern.
- multistorageclient.providers.posix_file.atomic_write(source: str | IO, destination: str)[source]¶
Writes the contents of a file to the specified destination path.
This function ensures that the file write operation is atomic, meaning the output file is either fully written or not modified at all. This is achieved by writing to a temporary file first and then renaming it to the destination path.
- class multistorageclient.providers.manifest_metadata.Manifest(version: str, parts: list[ManifestPartReference])[source]¶
A data class representing a dataset manifest.
- Parameters:
version (str)
parts (list[ManifestPartReference])
- static from_dict(data: dict) Manifest [source]¶
Creates a Manifest instance from a dictionary (parsed from JSON).
- parts: list[ManifestPartReference]¶
References to manifest parts.
- class multistorageclient.providers.manifest_metadata.ManifestMetadataProvider(storage_provider: StorageProvider, manifest_path: str, writable: bool = False)[source]¶
Creates a
ManifestMetadataProvider
.- Parameters:
storage_provider (StorageProvider) – Storage provider.
manifest_path (str) – Main manifest file path.
writable (bool) – If true, allows modifications and new manifests to be written.
- add_file(path: str, metadata: ObjectMetadata) None [source]¶
Add a file to be tracked by the
MetadataProvider
. Does not have to be reflected in listing until aMetadataProvider.commit_updates()
forces a persist. This function must tolerate duplicate calls (idempotent behavior).- Parameters:
path (str) – User-supplied virtual path
metadata (ObjectMetadata) – physical file metadata from StorageProvider
- Return type:
None
- commit_updates() None [source]¶
Commit any newly adding files, used in conjunction with
MetadataProvider.add_file()
.MetadataProvider
will persistently record any metadata changes.- Return type:
None
- get_object_metadata(path: str, include_pending: bool = False) ObjectMetadata [source]¶
Retrieves metadata or information about an object stored in the provider.
- Parameters:
- Returns:
A metadata object containing the information about the object.
- Return type:
- glob(pattern: str) list[str] [source]¶
Matches and retrieves a list of object keys in the storage provider that match the specified pattern.
- is_writable() bool [source]¶
Returns
True
if theMetadataProvider
supports writes elseFalse
.- Return type:
- list_objects(prefix: str, start_after: str | None = None, end_at: str | None = None, include_directories: bool = False) Iterator[ObjectMetadata] [source]¶
Lists objects in the storage provider under the specified prefix.
- Parameters:
prefix (str) – The prefix or path to list objects under.
start_after (str | None) – The key to start after (i.e. exclusive). An object with this key doesn’t have to exist.
end_at (str | None) – The key to end at (i.e. inclusive). An object with this key doesn’t have to exist.
include_directories (bool) – Whether to include directories in the result. When True, directories are returned alongside objects.
- Returns:
A iterator over objects metadata under the specified prefix.
- Return type:
- realpath(path: str) tuple[str, bool] [source]¶
Returns the canonical, full real physical path for use by a
StorageProvider
. This provides translation from user-visible paths to the canonical paths needed by aStorageProvider
.
- remove_file(path: str) None [source]¶
Remove a file tracked by the
MetadataProvider
. Does not have to be reflected in listing until aMetadataProvider.commit_updates()
forces a persist. This function must tolerate duplicate calls (idempotent behavior).- Parameters:
path (str) – User-supplied virtual path
- Return type:
None
- class multistorageclient.providers.manifest_metadata.ManifestPartReference(path: str)[source]¶
A data class representing a reference to dataset manifest part.
- Parameters:
path (str)
- class multistorageclient.providers.ais.AIStoreStorageProvider(endpoint: str = '', provider: str = 'ais', skip_verify: bool = True, ca_cert: str | None = None, timeout: float | tuple[float, float] | None = None, retry: dict[str, Any] | None = None, base_path: str = '', credentials_provider: CredentialsProvider | None = None, metric_counters: dict[CounterName, Counter] = {}, metric_gauges: dict[GaugeName, Gauge] = {}, metric_attributes_providers: Sequence[AttributesProvider] = (), **kwargs: Any)[source]¶
A concrete implementation of the
multistorageclient.types.StorageProvider
for interacting with NVIDIA AIStore.AIStore client for managing buckets, objects, and ETL jobs.
- Parameters:
endpoint (str) – The AIStore endpoint.
skip_verify (bool) – Whether to skip SSL certificate verification.
ca_cert (str | None) – Path to a CA certificate file for SSL verification.
timeout (float | tuple[float, float] | None) – Request timeout in seconds; a single float for both connect/read timeouts (e.g.,
5.0
), a tuple for separate connect/read timeouts (e.g.,(3.0, 10.0)
), orNone
to disable timeout.retry (dict[str, Any] | None) –
urllib3.util.Retry
parameters.token – Authorization token. If not provided, the
AIS_AUTHN_TOKEN
environment variable will be used.base_path (str) – The root prefix path within the bucket where all operations will be scoped.
credentials_provider (CredentialsProvider | None) – The provider to retrieve AIStore credentials.
metric_counters (dict[CounterName, Counter]) – Metric counters.
metric_gauges (dict[GaugeName, Gauge]) – Metric gauges.
metric_attributes_providers (Sequence[AttributesProvider]) – Metric attributes providers.
provider (str)
kwargs (Any)
- class multistorageclient.providers.ais.StaticAISCredentialProvider(username: str | None = None, password: str | None = None, authn_endpoint: str | None = None, token: str | None = None, skip_verify: bool = True, ca_cert: str | None = None)[source]¶
A concrete implementation of the
multistorageclient.types.CredentialsProvider
that provides static S3 credentials.Initializes the
StaticAISCredentialProvider
with the given credentials.- Parameters:
username (str | None) – The username for the AIStore authentication.
password (str | None) – The password for the AIStore authentication.
authn_endpoint (str | None) – The AIStore authentication endpoint.
token (str | None) – The AIStore authentication token. This is used for authentication if username, password and authn_endpoint are not provided.
skip_verify (bool) – If true, skip SSL certificate verification.
ca_cert (str | None) – Path to a CA certificate file for SSL verification.
- get_credentials() Credentials [source]¶
Retrieves the current credentials.
- Returns:
The current credentials used for authentication.
- Return type:
- class multistorageclient.providers.azure.AzureBlobStorageProvider(endpoint_url: str, base_path: str = '', credentials_provider: CredentialsProvider | None = None, metric_counters: dict[CounterName, Counter] = {}, metric_gauges: dict[GaugeName, Gauge] = {}, metric_attributes_providers: Sequence[AttributesProvider] = (), **kwargs: dict[str, Any])[source]¶
A concrete implementation of the
multistorageclient.types.StorageProvider
for interacting with Azure Blob Storage.Initializes the
AzureBlobStorageProvider
with the endpoint URL and optional credentials provider.- Parameters:
endpoint_url (str) – The Azure storage account URL.
base_path (str) – The root prefix path within the container where all operations will be scoped.
credentials_provider (CredentialsProvider | None) – The provider to retrieve Azure credentials.
metric_counters (dict[CounterName, Counter]) – Metric counters.
metric_gauges (dict[GaugeName, Gauge]) – Metric gauges.
metric_attributes_providers (Sequence[AttributesProvider]) – Metric attributes providers.
- class multistorageclient.providers.azure.StaticAzureCredentialsProvider(connection: str)[source]¶
A concrete implementation of the
multistorageclient.types.CredentialsProvider
that provides static Azure credentials.Initializes the
StaticAzureCredentialsProvider
with the provided connection string.- Parameters:
connection (str) – The connection string for Azure Blob Storage authentication.
- get_credentials() Credentials [source]¶
Retrieves the current credentials.
- Returns:
The current credentials used for authentication.
- Return type:
- class multistorageclient.providers.gcs.GoogleIdentityPoolCredentialsProvider(audience: str, token_supplier: str)[source]¶
A concrete implementation of the
multistorageclient.types.CredentialsProvider
that provides Google’s identity pool credentials.Initializes the
GoogleIdentityPoolCredentials
with the audience and token supplier.- Parameters:
- get_credentials() Credentials [source]¶
Retrieves the current credentials.
- Returns:
The current credentials used for authentication.
- Return type:
- class multistorageclient.providers.gcs.GoogleStorageProvider(project_id: str = '', endpoint_url: str = '', base_path: str = '', credentials_provider: CredentialsProvider | None = None, metric_counters: dict[CounterName, Counter] = {}, metric_gauges: dict[GaugeName, Gauge] = {}, metric_attributes_providers: Sequence[AttributesProvider] = (), **kwargs: Any)[source]¶
A concrete implementation of the
multistorageclient.types.StorageProvider
for interacting with Google Cloud Storage.Initializes the
GoogleStorageProvider
with the project ID and optional credentials provider.- Parameters:
project_id (str) – The Google Cloud project ID.
endpoint_url (str) – The custom endpoint URL for the GCS service.
base_path (str) – The root prefix path within the bucket where all operations will be scoped.
credentials_provider (CredentialsProvider | None) – The provider to retrieve GCS credentials.
metric_counters (dict[CounterName, Counter]) – Metric counters.
metric_gauges (dict[GaugeName, Gauge]) – Metric gauges.
metric_attributes_providers (Sequence[AttributesProvider]) – Metric attributes providers.
kwargs (Any)
- class multistorageclient.providers.gcs.StringTokenSupplier(token: str)[source]¶
Supply a string token to the Google Identity Pool.
- Parameters:
token (str)
- get_subject_token(context, request)[source]¶
Returns the requested subject token. The subject token must be valid.
- Args:
- context (google.auth.externalaccount.SupplierContext): The context object
containing information about the requested audience and subject token type.
- request (google.auth.transport.Request): The object used to make
HTTP requests.
- Raises:
- google.auth.exceptions.RefreshError: If an error is encountered during
subject token retrieval logic.
- Returns:
str: The requested subject token string.
- class multistorageclient.providers.gcs_s3.GoogleS3StorageProvider(*args, **kwargs)[source]¶
A concrete implementation of the
multistorageclient.types.StorageProvider
for interacting with GCS via its S3 interface.Initializes the
S3StorageProvider
with the region, endpoint URL, and optional credentials provider.- Parameters:
region_name – The AWS region where the S3 bucket is located.
endpoint_url – The custom endpoint URL for the S3 service.
base_path – The root prefix path within the S3 bucket where all operations will be scoped.
credentials_provider – The provider to retrieve S3 credentials.
metric_counters – Metric counters.
metric_gauges – Metric gauges.
metric_attributes_providers – Metric attributes providers.
- class multistorageclient.providers.oci.OracleStorageProvider(namespace: str, base_path: str = '', credentials_provider: CredentialsProvider | None = None, retry_strategy: dict[str, Any] | None = None, metric_counters: dict[CounterName, Counter] = {}, metric_gauges: dict[GaugeName, Gauge] = {}, metric_attributes_providers: Sequence[AttributesProvider] = (), **kwargs: Any)[source]¶
A concrete implementation of the
multistorageclient.types.StorageProvider
for interacting with Oracle Cloud Infrastructure (OCI) Object Storage.Initializes an instance of
OracleStorageProvider
.- Parameters:
namespace (str) – The OCI Object Storage namespace. This is a unique identifier assigned to each tenancy.
base_path (str) – The root prefix path within the bucket where all operations will be scoped.
credentials_provider (CredentialsProvider | None) – The provider to retrieve OCI credentials.
retry_strategy (dict[str, Any] | None) –
oci.retry.RetryStrategyBuilder
parameters.metric_counters (dict[CounterName, Counter]) – Metric counters.
metric_gauges (dict[GaugeName, Gauge]) – Metric gauges.
metric_attributes_providers (Sequence[AttributesProvider]) – Metric attributes providers.
kwargs (Any)
- class multistorageclient.providers.s3.S3StorageProvider(region_name: str = '', endpoint_url: str = '', base_path: str = '', credentials_provider: CredentialsProvider | None = None, metric_counters: dict[CounterName, Counter] = {}, metric_gauges: dict[GaugeName, Gauge] = {}, metric_attributes_providers: Sequence[AttributesProvider] = (), **kwargs: Any)[source]¶
A concrete implementation of the
multistorageclient.types.StorageProvider
for interacting with Amazon S3 or S3-compatible object stores.Initializes the
S3StorageProvider
with the region, endpoint URL, and optional credentials provider.- Parameters:
region_name (str) – The AWS region where the S3 bucket is located.
endpoint_url (str) – The custom endpoint URL for the S3 service.
base_path (str) – The root prefix path within the S3 bucket where all operations will be scoped.
credentials_provider (CredentialsProvider | None) – The provider to retrieve S3 credentials.
metric_counters (dict[CounterName, Counter]) – Metric counters.
metric_gauges (dict[GaugeName, Gauge]) – Metric gauges.
metric_attributes_providers (Sequence[AttributesProvider]) – Metric attributes providers.
kwargs (Any)
- class multistorageclient.providers.s3.StaticS3CredentialsProvider(access_key: str, secret_key: str, session_token: str | None = None)[source]¶
A concrete implementation of the
multistorageclient.types.CredentialsProvider
that provides static S3 credentials.Initializes the
StaticS3CredentialsProvider
with the provided access key, secret key, and optional session token.- Parameters:
- get_credentials() Credentials [source]¶
Retrieves the current credentials.
- Returns:
The current credentials used for authentication.
- Return type:
- class multistorageclient.providers.s8k.S8KStorageProvider(*args, **kwargs)[source]¶
A concrete implementation of the
multistorageclient.types.StorageProvider
for interacting with SwiftStack.Initializes the
S3StorageProvider
with the region, endpoint URL, and optional credentials provider.- Parameters:
region_name – The AWS region where the S3 bucket is located.
endpoint_url – The custom endpoint URL for the S3 service.
base_path – The root prefix path within the S3 bucket where all operations will be scoped.
credentials_provider – The provider to retrieve S3 credentials.
metric_counters – Metric counters.
metric_gauges – Metric gauges.
metric_attributes_providers – Metric attributes providers.
Telemetry¶
- class multistorageclient.telemetry.Telemetry[source]¶
Provides telemetry resources.
Instances shouldn’t be copied between processes. Not fork-safe or pickleable.
Instances can be shared between processes by registering with a
multiprocessing.managers.BaseManager
and using proxy objects.
- class multistorageclient.telemetry.TelemetryManager(address=None, authkey=None, serializer='pickle', ctx=None)[source]¶
A
multiprocessing.managers.BaseManager
for telemetry resources.The OpenTelemetry Python SDK isn’t fork-safe since telemetry sample buffers can be duplicated.
In addition, Python ≤3.12 doesn’t call exit handlers for forked processes. This causes the OpenTelemetry Python SDK to not flush telemetry before exiting.
https://github.com/open-telemetry/opentelemetry-python/issues/4215
https://github.com/open-telemetry/opentelemetry-python/issues/3307
Forking is multiprocessing’s default start method for non-macOS POSIX systems until Python 3.14.
To fully support multiprocessing, resampling + publishing is handled by a single process that’s (ideally) a child of (i.e. directly under) the main process. This:
Relieves other processes of this work.
Avoids issues with duplicate samples when forking and unpublished samples when exiting forks.
Allows cross-process resampling.
Reuses a single connection pool to telemetry backends.
The downside is it essentially re-introduces global interpreter lock (GIL) with additional IPC overhead. Telemetry operations, however, should be lightweight so this isn’t expected to be a problem. Remote data store latency should still be the primary throughput limiter for storage clients.
multiprocessing.managers.BaseManager
is used for this since it creates a separate server process for shared objects.Telemetry resources are provided as proxy objects for location transparency.
The documentation isn’t particularly detailed, but others have written comprehensively on this:
By specification, metric and tracer providers must call shutdown on any underlying metric readers + span processors + exporters.
In the OpenTelemetry Python SDK, provider shutdown is called automatically by exit handlers (when they work at least). Consequently, clients should:
Only receive proxy objects.
Enables metric reader + span processor + exporter re-use across processes.
Never call shutdown on the proxy objects.
The shutdown exit handler is registered on the manager’s server process.
⚠️ We expect a finite number of providers (i.e. no dynamic configs) so we don’t leak them.
- class multistorageclient.telemetry.TelemetryMode(value)[source]¶
How to create a
Telemetry
object.- CLIENT = 'client'¶
Connect to a telemetry IPC server.
- LOCAL = 'local'¶
Keep everything local to the process (not fork-safe).
- SERVER = 'server'¶
Start + connect to a telemetry IPC server.
- multistorageclient.telemetry.init(mode: TelemetryMode = TelemetryMode.SERVER, address: str | tuple[str, int] | None = None) Telemetry [source]¶
Create or return an existing
Telemetry
instance orTelemetry
proxy object.- Parameters:
mode (TelemetryMode) – How to create a
Telemetry
object.address (str | tuple[str, int] | None) – Telemetry IPC server address. Passed directly to a
multiprocessing.managers.BaseManager
. Ignored if the mode isTelemetryMode.LOCAL
.
- Returns:
A telemetry instance.
- Return type:
Attributes¶
- class multistorageclient.telemetry.attributes.base.AttributesProvider[source]¶
Provides
opentelemetry.util.types.Attributes
.
- class multistorageclient.telemetry.attributes.environment_variables.EnvironmentVariablesAttributesProvider(attributes: Mapping[str, str])[source]¶
Provides
opentelemetry.util.types.Attributes
from environment variables.
- class multistorageclient.telemetry.attributes.host.HostAttributesProvider(attributes: Mapping[str, str])[source]¶
Provides
opentelemetry.util.types.Attributes
from host information.
- class multistorageclient.telemetry.attributes.msc_config.MSCConfigAttributesProvider(attributes: Mapping[str, AttributeValueOptions], config_dict: Mapping[str, Any])[source]¶
Provides
opentelemetry.util.types.Attributes
from a multi-storage client configuration.- Parameters:
- class multistorageclient.telemetry.attributes.process.ProcessAttributesProvider(attributes: Mapping[str, str])[source]¶
Provides
opentelemetry.util.types.Attributes
from current process information.
- class multistorageclient.telemetry.attributes.static.StaticAttributesProvider(attributes: Mapping[str, str | bool | int | float | Sequence[str] | Sequence[bool] | Sequence[int] | Sequence[float]] | None)[source]¶
Provides
opentelemetry.util.types.Attributes
from static attributes.
- class multistorageclient.telemetry.attributes.thread.ThreadAttributesProvider(attributes: Mapping[str, str])[source]¶
Provides
opentelemetry.util.types.Attributes
from current thread information.
Metrics¶
Readers¶
- class multistorageclient.telemetry.metrics.readers.diperiodic_exporting.DiperiodicExportingMetricReader(exporter: MetricExporter, collect_interval_millis: float | None = None, collect_timeout_millis: float | None = None, export_interval_millis: float | None = None, export_timeout_millis: float | None = None)[source]¶
opentelemetry.sdk.metrics.export.MetricReader
that collects + exports metrics on separate user-configurable time intervals. This is in contrast withopentelemetry.sdk.metrics.export.PeriodicExportingMetricReader
which couples them with a 1 minute default.The metrics collection interval limits the temporal resolution. Most metric backends have 1 millisecond or finer temporal resolution.
- Parameters:
exporter (MetricExporter) – Metrics exporter.
collect_interval_millis (float | None) – Collect interval in milliseconds.
collect_timeout_millis (float | None) – Collect timeout in milliseconds.
export_interval_millis (float | None) – Export interval in milliseconds.
export_timeout_millis (float | None) – Export timeout in milliseconds.
- shutdown(timeout_millis: float = 40000, **kwargs) None [source]¶
Shuts down the MetricReader. This method provides a way for the MetricReader to do any cleanup required. A metric reader can only be shutdown once, any subsequent calls are ignored and return failure status.
When a MetricReader is registered on a
MeterProvider
,shutdown()
will invoke this automatically.- Parameters:
timeout_millis (float)
- Return type:
None
Generators¶
- class multistorageclient.generators.ManifestMetadataGenerator[source]¶
Generates a file metadata manifest for use with a
multistorageclient.providers.ManifestMetadataProvider
.- static generate_and_write_manifest(data_storage_client: StorageClient, manifest_storage_client: StorageClient, partition_keys: list[str] | None = None) None [source]¶
Generates a file metadata manifest.
The data storage client’s base path should be set to the root path for data objects (e.g.
my-bucket/my-data-prefix
).The manifest storage client’s base path should be set to the root path for manifest objects (e.g.
my-bucket/my-manifest-prefix
).The following manifest objects will be written with the destination storage client (with the total number of manifest parts being variable):
.msc_manifests/ ├── msc_manifest_index.json └── parts/ ├── msc_manifest_part000001.jsonl ├── ... └── msc_manifest_part999999.jsonl
- Parameters:
data_storage_client (StorageClient) – Storage client for reading data objects.
manifest_storage_client (StorageClient) – Storage client for writing manifest objects.
partition_keys (list[str] | None) – Optional list of keys to partition the listing operation. If provided, objects will be listed concurrently using these keys as boundaries.
- Return type:
None
Higher-Level Libraries¶
fsspec¶
- class multistorageclient.contrib.async_fs.MultiStorageAsyncFileSystem(*args, **kwargs)[source]¶
Custom
fsspec.asyn.AsyncFileSystem
implementation for MSC protocol (msc://
). Usesmultistorageclient.StorageClient
for backend operations.Initializes the
MultiStorageAsyncFileSystem
.- Parameters:
kwargs – Additional arguments for the
fsspec.asyn.AsyncFileSystem
.
- static asynchronize_sync(func: Callable[[...], Any], *args: Any, **kwargs: Any) Any [source]¶
Runs a synchronous function asynchronously using asyncio.
- cp_file(path1: str, path2: str, **kwargs: Any)[source]¶
Copies a file from the source path to the destination path.
- Parameters:
- Raises:
AttributeError – If the source and destination paths are associated with different profiles.
- get_file(rpath: str, lpath: str, **kwargs: Any) None [source]¶
Downloads a file from the remote path to the local path.
- ls(path: str, detail: bool = True, **kwargs: Any) list[dict[str, Any]] | list[str] [source]¶
Lists the contents of a directory.
- Parameters:
- Returns:
A list of file names or detailed information depending on the ‘detail’ argument.
- Return type:
- open(path: str, mode: str = 'rb', **kwargs: Any) PosixFile | ObjectFile [source]¶
Opens a file at the given path.
- pipe_file(path: str, value: bytes, **kwargs: Any) None [source]¶
Writes a value (bytes) directly to a file at the given path.
- put_file(lpath: str, rpath: str, **kwargs: Any) None [source]¶
Uploads a local file to the remote path.
- resolve_path_and_storage_client(path: str | PathLike) tuple[StorageClient, str] [source]¶
Resolves the path and retrieves the associated
multistorageclient.StorageClient
.- Parameters:
- Returns:
A tuple containing the
multistorageclient.StorageClient
and the resolved path.- Return type:
NumPy¶
- multistorageclient.contrib.numpy.load(*args: Any, **kwargs: Any) ndarray | dict[str, ndarray] | NpzFile [source]¶
Adapt
numpy.load
.
PyTorch¶
- class multistorageclient.contrib.torch.MultiStorageFileSystem[source]¶
A filesystem implementation that uses the MultiStoragePath class to handle paths.
- class multistorageclient.contrib.torch.MultiStorageFileSystemReader(path: str | PathLike, thread_count: int = 1)[source]¶
A reader implementation that uses the MultiStorageFileSystem class to handle file system operations.
Initialize the MultiStorageFileSystemReader with the MultiStorageFileSystem.
- Parameters:
- class multistorageclient.contrib.torch.MultiStorageFileSystemWriter(path: str | PathLike, single_file_per_rank: bool = True, sync_files: bool = True, thread_count: int = 1, per_thread_copy_ahead: int = 10000000, cache_staged_state_dict: bool = False, overwrite: bool = True)[source]¶
A writer implementation that uses the MultiStorageFileSystem class to handle file system operations.
Initialize the MultiStorageFileSystemWriter with the MultiStorageFileSystem.
- Parameters:
Xarray¶
- multistorageclient.contrib.xarray.open_zarr(*args: Any, **kwargs: Any) Dataset [source]¶
Adapt
xarray.open_zarr
to usemultistorageclient.contrib.zarr.LazyZarrStore
when path matches themsc
protocol.If the path starts with the MSC protocol, it uses
multistorageclient.contrib.zarr.LazyZarrStore
with a resolved storage client and prefix, passingmsc_max_workers
if provided. Otherwise, it directly callsxarray.open_zarr
.
Zarr¶
- class multistorageclient.contrib.zarr.LazyZarrStore(storage_client: StorageClient, prefix: str = '', msc_max_workers: int | None = None)[source]¶
- Parameters:
storage_client (StorageClient)
prefix (str)
msc_max_workers (int | None)
- getitems(keys: Sequence[str], *, contexts: Any) Mapping[str, Any] [source]¶
Retrieve data from multiple keys.
Parameters¶
- keysIterable[str]
The keys to retrieve
- contexts: Mapping[str, Context]
A mapping of keys to their context. Each context is a mapping of store specific information. E.g. a context could be a dict telling the store the preferred output array type: {“meta_array”: cupy.empty(())}
Returns¶
- Mapping
A collection mapping the input keys to their results.
Notes¶
This default implementation uses __getitem__() to read each key sequentially and ignores contexts. Overwrite this method to implement concurrent reads of multiple keys and/or to utilize the contexts.
- multistorageclient.contrib.zarr.open_consolidated(*args: Any, **kwargs: Any) Group [source]¶
Adapt
zarr.open_consolidated
to useLazyZarrStore
when path matches themsc
protocol.If the path starts with the MSC protocol, it uses
LazyZarrStore
with a resolved storage client and prefix, passingmsc_max_workers
if provided. Otherwise, it directly callszarr.open_consolidated
.