Manifests

A manifest is a file (or group of files) describing the objects in a dataset, such as names, sizes, last-modified timestamps, and custom metadata tags. Manifests are optional but can greatly accelerate object listing and metadata retrieval for large datasets in object stores.

A common approach is to prepare a manifest that includes metadata (e.g. object/file paths, sizes, custom tags) to speed up data loading and parallel processing of very large datasets. By reading a manifest, MSC can quickly discover (list) or filter (glob) objects without having to iterate over every object in the bucket or prefix.

Manifest Format

The MSC supports a manifest index (JSON) that references one or more parts manifests. The main manifest or manifest index:

  • Declares a version.

  • Lists each part manifest, including its path.

  • Specifies the format of the manifest parts (JSONL or Parquet).

The parts manifests can be stored in either:

  • JSON Lines (JSONL) format (.jsonl): Each line is a separate object’s metadata. JSONL is more scalable than a single JSON array for large manifests because each line can be processed incrementally, avoiding excessive memory usage.

  • Parquet format (.parquet): A columnar storage format that provides efficient compression and faster read performance for large datasets. Requires the pyarrow package.

Example Main Manifest (JSON)
{
  "version": "1.0",
  "format": "jsonl",
  "parts": [
    {
      "path": "parts/msc_manifest_part000001.jsonl"
    },
    {
      "path": "parts/msc_manifest_part000002.jsonl"
    }
  ]
}
Example Parts Manifest (JSONL)
{
  "key": "train/cat-pic001.jpg",
  "size_bytes": 1048576,
  "last_modified": "2024-09-05T15:45:00Z"
}
{
  "key": "train/cat-pic002.jpg",
  "size_bytes": 2097152,
  "last_modified": "2024-09-05T15:46:00Z"
}

Manifest Storage Organization

This example demonstrates how manifests are organized. Here, we assume that manifests are stored alongside the data in the same bucket. However, this is not strictly required, as MSC also supports placing manifests in a different location.

s3://bucketA/
    └── .msc_manifests/
        ├── 2024-09-06T14:55:29Z/
        │   ├── msc_manifest_index.json                   # Main manifest file
        │   └── parts/
        │       ├── msc_manifest_part000001.jsonl         # Split part of the manifest
        │       ├── msc_manifest_part000002.jsonl
        │       └── msc_manifest_part000003.jsonl
        └── 2024-10-01T10:21:42Z/                         # New version of the manifest
            ├── msc_manifest_index.json
            └── parts/
                ├── msc_manifest_part000001.jsonl
                ├── msc_manifest_part000002.jsonl
                └── msc_manifest_part000003.jsonl

Writing and Using Manifests Programmatically

MSC provides a multistorageclient.providers.ManifestMetadataProvider to read from and write to manifests, and a multistorageclient.generators.ManifestMetadataGenerator to generate the manifests. When manifests are configured as a “metadata provider,” MSC can utilize them for efficient object metadata retrieval.

Generating Manifests

Using the multistorageclient.generators.ManifestMetadataGenerator is straightforward. For example:

 1from multistorageclient import StorageClient
 2from multistorageclient.generators import ManifestMetadataGenerator
 3from multistorageclient.providers.manifest_formats import ManifestFormat
 4
 5# Suppose we have two clients:
 6# data_storage_client: Reads the data files we want to include in the manifest.
 7# manifest_storage_client: Writes the manifest to the desired path (bucket/folder).
 8
 9# Generate a JSONL manifest (default)
10ManifestMetadataGenerator.generate_and_write_manifest(
11    data_storage_client=data_storage_client,
12    manifest_storage_client=manifest_storage_client
13)
14
15# Generate a Parquet manifest (requires pyarrow)
16ManifestMetadataGenerator.generate_and_write_manifest(
17    data_storage_client=data_storage_client,
18    manifest_storage_client=manifest_storage_client,
19    manifest_format=ManifestFormat.PARQUET
20)

To use Parquet format, install the pyarrow package:

pip install multi-storage-client[parquet]

Referencing Manifests in Configuration

When you set a profile’s metadata_provider to type: manifest, you must also provide the manifest_path option, which refers to manifest path relative to the storage profile’s base_path.

You can also specify the format option to control the format used when writing new manifests (defaults to jsonl):

 1profiles:
 2  my-profile:
 3    storage_provider:
 4      type: s3
 5      options:
 6        base_path: my-bucket
 7    metadata_provider:
 8      type: manifest
 9      options:
10        manifest_path: .msc_manifests
11        format: parquet  # Optional: jsonl (default) or parquet

You can also store manifests in a different profile than your data. In that case, the metadata_provider will refer to storage profile using the storage_provider_profile option. Here’s an example:

 1profiles:
 2  my-manifest-profile:
 3    storage_provider:
 4      type: s3
 5      options:
 6        base_path: manifest-bucket
 7
 8  my-profile:
 9    storage_provider:
10      type: s3
11      options:
12        base_path: my-bucket
13      metadata_provider:
14        type: manifest
15        options:
16          # Refer to the storage profile for the manifests
17          storage_provider_profile: my-manifest-profile
18          # The real path of manifests in this will be manifest-bucket/.msc_manifests
19          manifest_path: .msc_manifests

Once configured, MSC automatically uses the manifests to speed up listing or retrieving metadata for objects whenever you perform MSC operations on that profile.