Remote Dataset
Since version 2.0.0, Megatron Energon supports the use of remote datasets. Since version >5.2.0, Energon file access is based on Multi Storage Client (MSC). This means you can train or validate with your data right from any storage by simply swapping the dataset path for a so-called MSC URL.
Prerequisites
For using a remote dataset, install one or more of the extras:
s3
aistore
azure-blob-storage
google-cloud-storage
oci
like this:
pip install megatron-energon[s3,oci]
Set up the msc config as described in Multi Storage Client documentation.
You can also use the rclone config with msc, as was described prior to 5.2.0.
The URL syntax
The syntax is a simple as
msc://CONFIG_NAME/PATH
For example:
msc://coolstore/mainbucket/datasets/somedata
You can use this URL instead of paths to datasets in
Functions like
get_train_dataset
,get_val_dataset
Inside metadataset specifications
As arguments to
energon prepare
orenergon lint
. Note that those may be slow for remote locations.
Example usage:
ds = get_train_dataset(
'msc://coolstore/mainbucket/datasets/somedata',
batch_size=1,
shuffle_buffer_size=100,
max_samples_per_sequence=100,
)