Remote Dataset

Since version 2.0.0, Megatron Energon supports the use of remote datasets. Since version >5.2.0, Energon file access is based on Multi Storage Client (MSC). This means you can train or validate with your data right from any storage by simply swapping the dataset path for a so-called MSC URL.

Prerequisites

For using a remote dataset, install one or more of the extras:

  • s3

  • aistore

  • azure-blob-storage

  • google-cloud-storage

  • oci

like this:

pip install megatron-energon[s3,oci]

Set up the msc config as described in Multi Storage Client documentation.

You can also use the rclone config with msc, as was described prior to 5.2.0.

The URL syntax

The syntax is a simple as

msc://CONFIG_NAME/PATH

For example:

msc://coolstore/mainbucket/datasets/somedata

You can use this URL instead of paths to datasets in

  • Functions like get_train_dataset, get_val_dataset

  • Inside metadataset specifications

  • As arguments to energon prepare or energon lint. Note that those may be slow for remote locations.

Example usage:

ds = get_train_dataset(
    'msc://coolstore/mainbucket/datasets/somedata',
    batch_size=1,
    shuffle_buffer_size=100,
    max_samples_per_sequence=100,
)