Remote Dataset

Since version 2.0.0, Megatron Energon supports the use of remote datasets that are stored in an object store bucket with an S3-compatible interface. This means you can train or validate with your data right from that object store by simply swapping the dataset path for a so-called rclone URL.

Prerequisites

To use this feature, you need to set up an Rclone configuration. Rclone is an open source software to manage files on cloud storage. While Energon does not depend on the Rclone software itself, we rely on the same configuration mechanism.

So if you don’t like to install or use Rclone, that’s fine, but you will need to set up a config file that is compatible. We still recommend using Rclone, since it’s a great tool.

Once you set up your config at ~/.config/rclone/rclone.conf, it may look like this:

[coolstore]
type = s3
provider = Other
access_key_id = MY_ACCESS_KEY_ID
secret_access_key = MY_SECRET_ACCESS_KEY
region = us-east-1
endpoint = pdx.s8k.io

The URL syntax

The syntax is a simple as

rclone://RCLONE_NAME/BUCKET/PATH

For example:

rclone://coolstore/mainbucket/datasets/somedata

You can use this URL instead of paths to datasets in

  • Functions like get_train_dataset, get_val_dataset

  • Inside metadataset specifications

  • As arguments to energon prepare or energon lint. Note that those may be slow for remote locations.

Example usage:

ds = get_train_dataset(
    'rclone://coolstore/mainbucket/datasets/somedata',
    batch_size=1,
    shuffle_buffer_size=100,
    max_samples_per_sequence=100,
)