# Dataset Subsets Dataset subsets allow restricting a dataset (or parts of a metadataset hierarchy) to a specific portion of the available samples. This is useful for rapid prototyping, ablation studies, different training stages, or constructing disjoint train/validation/test splits that differ from the original dataset configuration. A subset is defined by a two-element `range` list consisting of `[start, end]` (where `start` is inclusive, `end` exclusive). Each element can be either * a **percentage** string (e.g. `"0%"`, `"12.5%"`, `"100%"`) – interpreted relative to each inner dataset size, or * an **absolute** integer – interpreted as a sample index. Absolute indices are only allowed for *leaf* datasets (`path` to a prepared dataset containing `.nv-meta`). ## Basic example The snippet below keeps the first 80 % of *COYO* `train` split (as defined in the `split.yaml`) for training while evaluating on the remaining 20 % of the `train` split. Note how the `subset` key is placed directly next to the corresponding `path`. ```yaml __module__: megatron.energon __class__: MetadatasetV2 splits: train: path: ./coyo subset: {range: [0%, 80%]} val: path: ./coyo split: train subset: {range: [80%, 100%]} ``` ## Nested subsets and merging rules Subsets can appear at any level that ultimately yields samples (direct `path` reference to a prepared dataset containing `.nv-meta`, `join`, `blend`, `blend_epochized`). When multiple subsets are nested, the *inner* subset is applied first, then the portion selected by the *outer* subset is applied *within* the already selected range. For percentages the ranges are composed multiplicatively. Example: the outer subset `[0%, 50%]` followed by an inner subset `[25%, 75%]` results in the final range `[25%, 50%]` of the original dataset. Absolute indices short-circuit merging: they can **only** be specified at the leaf level and must not be combined with another absolute range farther up the hierarchy. ## Absolute ranges Absolute indices are handy when exact sample counts are required. ## Advanced examples The following configuration combines the absolute ranges with the nested rules. The inner subset takes the first **1000** samples from *COCO* train split and mixes them with the full *COYO* train split using weight-based blending. The outer nesting then reduces the inner ranges to the first 50%, thus only taking the first **500** samples of *COCO*, mixed with the first **50%** of the *COYO* dataset effectively. ```yaml __module__: megatron.energon __class__: MetadatasetV2 splits: train: # The subset is applied to each blended dataset separately. # I.e. for the first, the sample range is [0, 500], for the second the range is [0%, 50%] subset: {range: [0%, 50%]} blend: - weight: 1.0 path: ./coco subset: # Take exactly 1000 samples (indices 0-999) range: [0, 1000] - weight: 1.0 path: ./coyo ``` Absolute ranges can also be specified to run up to the end of the dataset using the `end` keyword: ```yaml __module__: megatron.energon __class__: MetadatasetV2 splits: train: path: ./coyo subset: {range: [1422, end]} ``` ## Python usage No API changes are required on the Python side – subsets are fully specified in the YAML. Simply load the dataset with the regular helpers.