Dataset Subsets

Dataset subsets allow restricting a dataset (or parts of a metadataset hierarchy) to a specific portion of the available samples. This is useful for rapid prototyping, ablation studies, different training stages, or constructing disjoint train/validation/test splits that differ from the original dataset configuration.

A subset is defined by a two-element range list consisting of [start, end] (where start is inclusive, end exclusive). Each element can be either

  • a percentage string (e.g. "0%", "12.5%", "100%") – interpreted relative to each inner dataset size, or

  • an absolute integer – interpreted as a sample index. Absolute indices are only allowed for leaf datasets (path to a prepared dataset containing .nv-meta).

Basic example

The snippet below keeps the first 80 % of COYO train split (as defined in the split.yaml) for training while evaluating on the remaining 20 % of the train split. Note how the subset key is placed directly next to the corresponding path.

__module__: megatron.energon
__class__: MetadatasetV2
splits:
  train:
    path: ./coyo
    subset: {range: [0%, 80%]}
  val:
    path: ./coyo
    split: train
    subset: {range: [80%, 100%]}

Nested subsets and merging rules

Subsets can appear at any level that ultimately yields samples (direct path reference to a prepared dataset containing .nv-meta, join, blend, blend_epochized). When multiple subsets are nested, the inner subset is applied first, then the portion selected by the outer subset is applied within the already selected range. For percentages the ranges are composed multiplicatively.

Example: the outer subset [0%, 50%] followed by an inner subset [25%, 75%] results in the final range [25%, 50%] of the original dataset.

Absolute indices short-circuit merging: they can only be specified at the leaf level and must not be combined with another absolute range farther up the hierarchy.

Absolute ranges

Absolute indices are handy when exact sample counts are required.

Advanced examples

The following configuration combines the absolute ranges with the nested rules. The inner subset takes the first 1000 samples from COCO train split and mixes them with the full COYO train split using weight-based blending. The outer nesting then reduces the inner ranges to the first 50%, thus only taking the first 500 samples of COCO, mixed with the first 50% of the COYO dataset effectively.

__module__: megatron.energon
__class__: MetadatasetV2
splits:
  train:
    # The subset is applied to each blended dataset separately.
    # I.e. for the first, the sample range is [0, 500], for the second the range is [0%, 50%]
    subset: {range: [0%, 50%]}
    blend:
      - weight: 1.0
        path: ./coco
        subset:
          # Take exactly 1000 samples (indices 0-999)
          range: [0, 1000]
      - weight: 1.0
        path: ./coyo

Absolute ranges can also be specified to run up to the end of the dataset using the end keyword:

__module__: megatron.energon
__class__: MetadatasetV2
splits:
  train:
    path: ./coyo
    subset: {range: [1422, end]}

Python usage

No API changes are required on the Python side – subsets are fully specified in the YAML. Simply load the dataset with the regular helpers.