Utils
pickles_to_tars(dir_input, input_prefix_subset, input_suffix, dir_output, output_prefix, func_output_data=lambda prefix, suffix_to_data: {'__key__': prefix, None: suffix_to_data}, min_num_shards=None)
Convert a subset of pickle files from a directory to Webdataset tar files.
Input path and name pattern for sample 0: f"{dir_input}/{input_prefix_subset[0]}.{input_suffix[0]}" f"{dir_input}/{input_prefix_subset[0]}.{input_suffix[1]}" Input path and name pattern for sample 1: f"{dir_input}/{input_prefix_subset[1]}.{input_suffix[0]}" f"{dir_input}/{input_prefix_subset[1]}.{input_suffix[1]}" ... Output path and name pattern: f"{dir_output}/{output_prefix}-%06d.tar".
The webdataset tar archive is specified by the dictionary: { "key" : sample_filename_preifx, sample_filename_suffix_1 : data_1, sample_filename_suffix_2 : data_2, ... } so that parsing the tar archive is equivalent of reading {sample_filename_preifx}.{sample_filename_suffix_1} etc.
Here, each sample data get its name prefix from one element of
input_prefix_subset
and its name suffixes from the list input_suffix
.
Per the webdataset file format specification, the sample_filename_preifx
can't contain dots '.' so this function removes it for the user by calling
.replace(".", "-") on the elements of input_prefix_subset
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dir_input
|
str
|
Input directory |
required |
input_prefix_subset
|
List[str]
|
Input subset of pickle files' prefix |
required |
input_suffix
|
Union[str, Iterable[str]]
|
Input pickle file name suffixes, each for one type of data object, for all the samples |
required |
dir_output
|
str
|
Output directory |
required |
output_prefix
|
str
|
Output tar file name prefix |
required |
func_output_data
|
Callable[[str, Dict[str, Any]], Dict[str, Any]]
|
function that maps the name prefix, name suffix and data object to a webdataset tar archive dictionary. Refer to the webdataset github repo for the archive file format specification. |
lambda prefix, suffix_to_data: {'__key__': prefix, None: suffix_to_data}
|
min_num_shards
|
create at least this number of tar files. WebDataset has bugs when reading small number of tar files in a multi-node lightening + DDP setting so this option can be used to guarantee the tar file counts |
None
|
Source code in bionemo/webdatamodule/utils.py
25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 |
|