dataloaders

DataLoader utilities for language model training and validation.

Functions

create_validation_dataloader

create_padded_tensor

create_padded_tensor(tensor, desired_shape, padding_value=0)
Parameters:
  • tensor (TensorT)

  • desired_shape (Sequence[int])

  • padding_value (float)

Return type:

TensorT

create_validation_dataloader(accelerator, seed, tokenizer, block_size, dataset, content_field, fim_rate, fim_spm_rate, micro_batch_size, eval_samples=None, load_dataset_fn=<function load_from_disk_fn>, dataset_name='__auto__', keep_in_memory=False, source_datasets_to_discard=(), bos_rate=1.0, varlen=True, shuffle_seed=None)
Parameters:
  • accelerator (Accelerator | None)

  • seed (int)

  • tokenizer (PreTrainedTokenizerBase)

  • block_size (int)

  • dataset (str | Mapping[str, Dataset])

  • content_field (str)

  • fim_rate (float)

  • fim_spm_rate (float)

  • micro_batch_size (int)

  • eval_samples (int | None)

  • load_dataset_fn (LoadDatasetFn)

  • dataset_name (str)

  • keep_in_memory (bool)

  • source_datasets_to_discard (Sequence[str])

  • bos_rate (float)

  • varlen (bool)

  • shuffle_seed (int | None)