Remote Dataset

Megatron Energon supports the use of remote datasets. Since version >5.2.0, Energon file access is based on Multi Storage Client (MSC). This means you can train or validate with your data right from any storage by simply swapping the dataset path for a so-called MSC URL.

Prerequisites

For using a remote dataset, install energon with one or more of the extras:

  • s3

  • aistore

  • azure-blob-storage

  • google-cloud-storage

  • oci

like this:

pip install megatron-energon[s3,oci]

Set up the msc config as described in Multi Storage Client documentation. You can also use the rclone config with msc, as was described prior to 5.2.0.

For fast data loading we recommend to activate MSC local caching:

cache:
  size: 500G
  use_etag: true
  eviction_policy:
    policy: "fifo"
    refresh_interval: 3600
  cache_backend:
    cache_path: /tmp/msc_cache # prefer to use local NVMe, but Lustre path also works

And point MSC to the config with

export MSC_CONFIG=/path/to/msc_config.yaml

The URL syntax

The syntax is a simple as

msc://CONFIG_NAME/PATH

For example:

msc://coolstore/mainbucket/datasets/somedata

You can use this URL instead of paths to datasets in

  • Functions like get_train_dataset, get_val_dataset

  • Inside metadataset specifications

  • As arguments to energon prepare or energon lint. Note that those may be slow for remote locations.

  • Or as a path to energon mount to locally inspect your remote dataset 😎

Example usage:

ds = get_train_dataset(
    'msc://coolstore/mainbucket/datasets/somedata',
    batch_size=1,
    shuffle_buffer_size=100,
    max_samples_per_sequence=100,
)