Datamodule utils
float_or_int_or_none(value)
Converts a given value into a float, int, or None.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
value
|
Union[str, float, int, None]
|
A value that can be either a string, float, int, or None. |
required |
Returns:
Type | Description |
---|---|
Union[float, int, None]
|
Union[float, int, None]: A float, int, or None based on the input value. |
If the input value is None or "None", it returns None. If the input value is an int or float, it returns the same value. If the input value is a string, it tries to convert it into an int if possible, otherwise into a float.
Source code in bionemo/llm/utils/datamodule_utils.py
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 |
|
infer_global_batch_size(micro_batch_size, num_nodes, devices, accumulate_grad_batches=1, tensor_model_parallel_size=1, pipeline_model_parallel_size=1)
Infers the global batch size based on the micro batch size, number of nodes, devices, accumulation of gradient batches, and model parallel sizes.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
micro_batch_size
|
int
|
The micro batch size. |
required |
num_nodes
|
int
|
The number of nodes. |
required |
devices
|
int
|
The number of devices. |
required |
accumulate_grad_batches
|
int
|
The accumulation of gradient batches. Defaults to 1. |
1
|
tensor_model_parallel_size
|
int
|
The tensor model parallel size. Defaults to 1. |
1
|
pipeline_model_parallel_size
|
int
|
The pipeline model parallel size. Defaults to 1. |
1
|
Returns:
Name | Type | Description |
---|---|---|
int |
int
|
The global batch size. |
Source code in bionemo/llm/utils/datamodule_utils.py
57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 |
|
infer_num_samples(limit_batches, num_samples_in_dataset, global_batch_size, stage)
Infers the number of samples based on the limit_batches parameter, the length of the dataset, and the global batch size.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
limit_batches
|
Union[float, int, str, None]
|
The limit on the number of batches. Can be a float between 0 and 1, an integer, a string, or None. If None, defaults to 1.0. |
required |
num_samples_in_dataset
|
int
|
The number of samples in the dataset. |
required |
global_batch_size
|
int
|
The global batch size. |
required |
stage
|
str
|
The stage of the training. |
required |
Returns:
Name | Type | Description |
---|---|---|
int |
The number of samples from the limit. |
Raises:
Type | Description |
---|---|
ValueError
|
If the limited number of samples is less than the global batch size, or if the limit_batches parameter is invalid. |
If limit_batches is a float between 0 and 1, the number of samples is inferred as a fraction of the number of samples in the dataset. If limit_batches is an integer greater than or equal to 1, the number of limited samples is inferred as the product of limit_batches and global batch size. If limit_batches is None, it defaults to 1.0, indicating that all dataset samples should be used.
Source code in bionemo/llm/utils/datamodule_utils.py
119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 |
|
parse_kwargs_to_arglist(kwargs)
Converts a dictionary of keyword arguments into a list of command-line arguments.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
kwargs
|
Dict[str, Any]
|
A dictionary where keys are argument names and values are argument values. |
required |
Returns:
Type | Description |
---|---|
List[str]
|
A list of strings, where each string is a command-line argument in the format '--argument-name value'. |
Source code in bionemo/llm/utils/datamodule_utils.py
42 43 44 45 46 47 48 49 50 51 52 53 54 |
|