User Guide

Concepts

MSC has 3 main concepts:

storage service

A service that stores objects/files such as AWS S3, Azure Blob Storage, Google Cloud Storage (GCS), NVIDIA AIStore, Oracle Cloud Infrastructure (OCI) Object Storage, POSIX file systems, and more.

provider

A provider implements generic object/file operations such as create, read, update, delete, and list or supply credentials for a specific storage service.

Providers are further subdivided into storage providers, metadata providers, and credentials providers.

Storage providers operate on a storage service directly.

Metadata providers operate on manifest files to accelerate object/file enumeration and metadata retrieval.

Credentials providers supply credentials for accessing objects/files.

client

The client exposes generic object and file operations such as create, read, update, delete, and list. It does validation and path translation before calling a provider. A client may bundle several providers together.

Installation

MSC is vended as the multi-storage-client package on PyPI.

The base client supports POSIX file systems by default, but there are extras for each storage service which provide the necessary package dependencies for its corresponding storage provider.

Install MSC with storage provider dependencies.
# POSIX file systems.
pip install multi-storage-client

# NVIDIA AIStore.
pip install "multi-storage-client[aistore]"

# Azure Blob Storage.
pip install "multi-storage-client[azure-storage-blob]"

# AWS S3 and S3-compatible object stores.
pip install "multi-storage-client[boto3]"

# Google Cloud Storage (GCS).
pip install "multi-storage-client[google-cloud-storage]"

# Oracle Cloud Infrastructure (OCI) Object Storage.
pip install "multi-storage-client[oci]"

MSC also implements adapters to let higher-level libraries like fsspec or PyTorch work wth the MSC. Likewise, there are extras for each higher level library.

Install MSC with higher-level library adapter dependencies.
# fsspec.
pip install "multi-storage-client[fsspec]"

# PyTorch.
pip install "multi-storage-client[torch]"

# Xarray.
pip install "multi-storage-client[xarray]"

# Zarr.
pip install "multi-storage-client[zarr]"

Usage

Configuration

Before using the MSC, we need to create an MSC configuration. This configuration defines profiles which define provider configurations.

MSC configurations can be file or dictionary-based.

File-Based

File-based configurations are YAML or JSON-based.

YAML-based configuration.
 1profiles:
 2  default:
 3    storage_provider:
 4      type: file
 5      options:
 6        base_path: /
 7  my-profile:
 8    storage_provider:
 9      type: s3
10      options:
11        base_path: my-bucket
12    metadata_provider:
13      type: manifest
14      options:
15        manifest_path: .msc_manifests
JSON-based configuration.
 1{
 2  "profiles": {
 3    "default": {
 4      "storage_provider": {
 5        "type": "file",
 6        "options": {
 7          "base_path": "/"
 8        }
 9      }
10    },
11    "my-profile": {
12      "storage_provider": {
13        "type": "s3",
14        "options": {
15          "base_path": "my-bucket"
16        }
17      },
18      "metadata_provider": {
19        "type": "manifest",
20        "options": {
21          "manifest_path": ".msc_manifests"
22        }
23      }
24    }
25  }
26}

The schema for each profile object is the constructor keyword arguments for multistorageclient.StorageClientConfig with these additions:

  • A type field for each provider set to a keyword (e.g. file, s3) or fully-qualified Python class name (e.g. my_module.providers.CustomProvider) to indicate which provider to use.

  • A provider_bundle field set to a fully-qualified Python class name (e.g. my_module.providers.CustomProviderBundle) which implements multistorageclient.types.ProviderBundle to indicate which provider bundle to use.

    • This takes precedence over the other provider fields.

Note

The default profile can only use file as the storage provider type.

You must create non-default profiles to use other storage providers.

Note

The credentials_provider field is optional.

If omitted, the client used by the storage provider will use its default credentials sourcing mechanism (e.g. environment variables, configuration files, environment metadata services).

Omitting this field is recommended if you plan on storing your MSC configuration file in source control (e.g. Git).

The options field for provider objects is passed as arguments to multistorageclient.providers class constructors.

MSC checks for file-based configurations with the following priority:

  1. /etc/msc_config.yaml

  2. ~/.config/msc/config.yaml

  3. ~/.msc_config.yaml

  4. /etc/msc_config.json

  5. ~/.config/msc/config.json

  6. ~/.msc_config.json

Dictionary-Based

Note

This option can only be used if you create multistorageclient.StorageClient instances directly. See Object/File Operations for the different ways to interact with MSC.

Dictionary-based configurations use Python dictionaries with multistorageclient.StorageClientConfig.from_dict().

The schema is the same as file-based configurations.

 1from multistorageclient import StorageClient, StorageClientConfig
 2
 3config = StorageClientConfig.from_dict(
 4    config_dict={
 5        "profiles": {
 6            "default": {
 7                "storage_provider": {
 8                    "type": "file",
 9                    "options": {
10                        "base_path": "/"
11                    }
12                }
13            }
14        }
15    }
16)
17
18client = StorageClient(config=config)

Rclone-Based

MSC also supports using an rclone configuration file as the source for MSC profiles. This is particularly useful if you already have an rclone configuration file and want to leverage the same profiles for MSC.

In an rclone configuration file, profiles are defined as INI sections, and the keys follow rclone’s naming conventions. MSC will parse these files to create the corresponding provider configurations.

Rclone-based configuration.
1[my-profile]
2type = s3
3base_path = my-bucket
4access_key_id = my-access-key-id
5secret_key_id = my-secret-key-id
6endpoint = https://my-endpoint
7region = us-east-1

MSC checks for rclone-based configurations with the following priority:

  1. The same directory as the rclone executable (if found in PATH).

  2. XDG_CONFIG_HOME/rclone/rclone.conf (if XDG_CONFIG_HOME is set).

  3. /etc/rclone.conf

  4. ~/.config/rclone/rclone.conf

  5. ~/.rclone.conf

Note

MSC File-Based configuration uses different configuration keys than rclone. For example, MSC uses endpoint_url for multistorageclient.StorageClient.S3StorageProvider but rclone expects endpoint. MSC aligns with rclone defaults so that if you have a rclone configuration, you can use it with MSC without any modifications on existing keys.

Note

Rclone configuration primarily focus on storage access. Some MSC features such as caching and observability cannot be enabled with a rclone configuration. Therefore, MSC allows to use a rclone-based configuration for storage acceess alongside with a built-in File-Based configuration for additional features. You can also use the built-in file-based configuration to add extra parameters to an individual profile such as metadata_provider.

Object/File Operations

There’s 3 ways to interact with MSC:

Shortcuts

Shortcuts automatically create and manage multistorageclient.StorageClient instances for you. They only support file-based configuration.

 1from multistorageclient import open, download_file
 2
 3# Create a client for the default profile and open a file.
 4file = open(url="msc://default/animal-photos/giant-panda.png")
 5
 6# Reuse the client for the default profile and download a file.
 7download_file(
 8    url="msc://default/animal-photos/red-panda.png",
 9    local_path="/tmp/animal-photos/red-panda.png"
10)

Shortcuts use msc://{profile name}/{file/object path relative to the storage provider's base path} URLs for file/object paths.

See multistorageclient for all shortcut methods.

Clients

There may be times when you want to create and manage clients by yourself for programmatic configuration or manual lifecycle control instead of using shortcuts.

You can create multistorageclient.StorageClientConfig and multistorageclient.StorageClient instances directly.

 1from multistorageclient import StorageClient, StorageClientConfig
 2
 3# Use a file-based configuration.
 4config = StorageClientConfig.from_file()
 5
 6# Use a dictionary-based configuration.
 7config = StorageClientConfig.from_dict(
 8    config_dict={
 9        "profiles": {
10            "default": {
11                "storage_provider": {
12                    "type": "file",
13                    "options": {
14                        "base_path": "/"
15                    }
16                }
17            }
18        }
19    }
20)
21
22# Create a client for the default profile.
23client = StorageClient(config=config)
24
25# Open a file.
26file = client.open("tmp/animal-photos/red-panda.png")

Clients use file/object paths relative to the storage provider’s base path.

Higher-Level Libraries

The MSC adapters for higher-level libraries use shortcuts under the hood.

fsspec

multistorageclient.async_fs aliases the multistorageclient.contrib.async_fs module.

This module provides the multistorageclient.contrib.async_fs.MultiAsyncFileSystem class which implements fsspec’s AsyncFileSystem class.

Note: The msc:// protocol is automatically registered with fsspec when pip install multi-storage-client.

 1import multistorageclient as msc
 2
 3# Create an MSC-based AsyncFileSystem instance.
 4fs = msc.async_fs.MultiAsyncFileSystem()
 5
 6# Create a client for the default profile and open a file.
 7file = fs.open("msc://default/animal-photos/red-panda.png")
 8
 9# Reuse the client for the default profile and download a file.
10fs.get_file(
11   rpath="msc://default/animal-photos/red-panda.png",
12   lpath="/tmp/animal-photos/red-panda.png"
13)
NumPy

multistorageclient.numpy aliases the multistorageclient.contrib.numpy module.

This module provides load, memmap, and save methods for loading and saving NumPy arrays.

 1import multistorageclient as msc
 2import numpy
 3
 4# Create a client for the default profile and load an array.
 5array = msc.numpy.load("msc://default/numpy-arrays/ndarray-1.npz")
 6
 7# Reuse the client for the default profile and load a memory-mapped array.
 8mmarray = msc.numpy.memmap("msc://default/numpy-arrays/ndarray-1.bin")
 9
10# Reuse the client for the default profile and save an array.
11msc.numpy.save(numpy.array([1, 2, 3, 4, 5], dtype=numpy.int32), "msc://default/numpy-arrays/ndarray-2.npz")
PyTorch

multistorageclient.torch aliases the multistorageclient.contrib.torch module.

This module provides load and save methods for loading and saving PyTorch data.

1import multistorageclient as msc
2import torch
3
4# Create a client for the default profile and load a tensor.
5tensor = msc.torch.load("msc://default/pytorch-tensors/tensor-1.pt")
6
7# Reuse the client for the default profile and save a tensor.
8msc.torch.save(torch.tensor([1, 2, 3, 4]), "msc://default/pytorch-tensors/tensor-2.pt")
Xarray

multistorageclient.xz aliases the multistorageclient.contrib.xarray module.

This module provides open_zarr for reading Xarray datasets from Zarr files/objects.

1import multistorageclient as msc
2
3# Create a client for the default profile and load a Zarr array into an Xarray dataset.
4xarray_dataset = msc.xz.open_zarr("msc://default/abc.zarr")

Note: Xarray supports fsspec URLs natively, so you can use Xarray standard interface with msc:// URLs.

1import xarray
2
3# Use Xarray native interface to load a Zarr array into an Xarray dataset.
4xarray_dataset = xarray.open_zarr("msc://default/abc.zarr")
Zarr

multistorageclient.zarr aliases the multistorageclient.contrib.zarr module.

This module provides open_consolidated for reading Zarr groups from files/objects.

1import multistorageclient as msc
2
3# Create a client for the default profile and load a Zarr array.
4z = msc.zarr.open_consolidated("msc://default/abc.zarr")

Note: Zarr supports fsspec URLs natively, so you can use Zarr standard interface with msc:// URLs.

1import zarr
2
3# Use Zarr native interface to load a Zarr array.
4z = zarr.open("msc://default/abc.zarr")

Manifests

Overview

A manifest is a file (or group of files) describing the objects in a dataset, such as names, sizes, last-modified timestamps, and custom metadata tags. Manifests are optional but can greatly accelerate object listing and metadata retrieval for large datasets in object stores. A common approach is to prepare a manifest that includes metadata (e.g. object/file paths, sizes, custom tags) to speed up data loading and parallel processing of very large datasets. By reading a manifest, MSC can quickly discover (list) or filter (glob) objects without having to iterate over every object in the bucket or prefix.

Manifest Format

The MSC supports a manifest index (JSON) that references one or more parts manifests (JSONL). The main manifest or manifest index:

  • Declares a version.

  • Lists each part manifest, including its path.

The parts manifests are stored in JSON Lines (.jsonl) format, where each line is a separate object’s metadata. JSONL is more scalable than a single JSON array for large manifests because each line can be processed incrementally, avoiding excessive memory usage.

Example Main Manifest (JSON)
{
  "version": "1.0",
  "parts": [
    {
      "path": "parts/msc_manifest_part000001.jsonl"
    },
    {
      "path": "parts/msc_manifest_part000002.jsonl"
    }
  ]
}
Example Parts Manifest (JSONL)
{
   "key": "train/cat-pic001.jpg",
   "size_bytes": 1048576,
   "last_modified": "2024-09-05T15:45:00Z"
}
{
   "key": "train/cat-pic002.jpg",
   "size_bytes": 2097152,
   "last_modified": "2024-09-05T15:46:00Z"
}

Manifest Storage Organization

his example demonstrates how manifests are organized. Here, we assume that manifests are stored alongside the data in the same bucket. However, this is not strictly required, as MSC also supports placing manifests in a different location.

s3://bucketA/
    └── .msc_manifests/
        ├── 2024-09-06T14:55:29Z/
        │   ├── msc_manifest_index.json                   # Main manifest file
        │   └── parts/
        │       ├── msc_manifest_part000001.jsonl         # Split part of the manifest
        │       ├── msc_manifest_part000002.jsonl
        │       └── msc_manifest_part000003.jsonl
        └── 2024-10-01T10:21:42Z/                         # New version of the manifest
            ├── msc_manifest_index.json
            └── parts/
                ├── msc_manifest_part000001.jsonl
                ├── msc_manifest_part000002.jsonl
                └── msc_manifest_part000003.jsonl

Writing and Using Manifests Programmatically

MSC provides a multistorageclient.providers.ManifestMetadataProvider to read from and write to manifests, and a multistorageclient.providers.manifest_metadata.ManifestMetadataGenerator to generate the manifests. When manifests are configured as a “metadata provider,” MSC can utilize them for efficient object metadata retrieval.

Generating Manifests Using the ManifestMetadataGenerator is straightforward. For example:

 1from multistorageclient import StorageClient
 2from multistorageclient.providers.manifest_metadata import ManifestMetadataGenerator
 3
 4# Suppose we have two clients:
 5# data_storage_client: Reads the data files we want to include in the manifest.
 6# manifest_storage_client: Writes the manifest to the desired path (bucket/folder).
 7
 8# This code enumerates all objects from data_storage_client, then writes out
 9# a main manifest + parts manifest(s) using manifest_storage_client.
10
11ManifestMetadataGenerator.generate_and_write_manifest(
12    data_storage_client=data_storage_client,
13    manifest_storage_client=manifest_storage_client
14)

Referencing Manifests in Configuration When you set a profile’s metadata_provider to type: manifest, you must also provide the manifest_path option, which refers to manifest path relative to the storage profile’s base_path. For example:

 1profiles:
 2my-profile:
 3   storage_provider:
 4      type: s3
 5      options:
 6      base_path: "my-bucket"
 7   metadata_provider:
 8      type: manifest
 9      options:
10      manifest_path: ".msc_manifests"

You can also store manifests in a different profile than your data. In that case, the metadata_provider will refer to storage profile using the storage_provider_profile option. Here’s an example:

 1profiles:
 2my-manifest-profile:
 3   storage_provider:
 4      type: s3
 5      options:
 6      base_path: "manifest-bucket"
 7
 8my-profile:
 9   storage_provider:
10      type: s3
11      options:
12      base_path: "my-bucket"
13   metadata_provider:
14      type: manifest
15      options:
16      # Refer to the storage profile for the manifests
17      storage_provider_profile: "my-manifest-profile"
18      # The real path of manifests in this will be manifest-bucket/.msc_manifests
19      manifest_path: ".msc_manifests"

Once configured, MSC automatically uses the manifests to speed up listing or retrieving metadata for objects whenever you perform MSC operations on that profile.