Remote Dataset
Megatron Energon supports the use of remote datasets. Since version >5.2.0, Energon file access is based on Multi Storage Client (MSC). This means you can train or validate with your data right from any storage by simply swapping the dataset path for a so-called MSC URL.
Prerequisites
For using a remote dataset, install energon with one or more of the extras:
s3
aistore
azure-blob-storage
google-cloud-storage
oci
like this:
pip install megatron-energon[s3,oci]
Set up the msc config as described in Multi Storage Client documentation. You can also use the rclone config with msc, as was described prior to 5.2.0.
For fast data loading we recommend to activate MSC local caching:
cache:
size: 500G
use_etag: true
eviction_policy:
policy: "fifo"
refresh_interval: 3600
cache_backend:
cache_path: /tmp/msc_cache # prefer to use local NVMe, but Lustre path also works
And point MSC to the config with
export MSC_CONFIG=/path/to/msc_config.yaml
The URL syntax
The syntax is a simple as
msc://CONFIG_NAME/PATH
For example:
msc://coolstore/mainbucket/datasets/somedata
You can use this URL instead of paths to datasets in
Functions like
get_train_dataset
,get_val_dataset
Inside metadataset specifications
As arguments to
energon prepare
orenergon lint
. Note that those may be slow for remote locations.Or as a path to
energon mount
to locally inspect your remote dataset 😎
Example usage:
ds = get_train_dataset(
'msc://coolstore/mainbucket/datasets/somedata',
batch_size=1,
shuffle_buffer_size=100,
max_samples_per_sequence=100,
)