Preprocess
GeneformerPreprocess
Source code in bionemo/geneformer/data/singlecell/preprocess.py
74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 |
|
__init__(download_directory, medians_file_path, tokenizer_vocab_path)
Downloads HGNC symbols
preproc_dir (str): Directory to store the reference preproc in tokenizer_vocab_path (str): Filepath to store the tokenizer vocab dataset_conf (OmegaConf): has 'train', 'val', 'test' keys containing the names of preprocessed train/val/test files to use for training.
Source code in bionemo/geneformer/data/singlecell/preprocess.py
75 76 77 78 79 80 81 82 83 84 85 86 87 88 |
|
build_and_save_tokenizer(median_dict, gene_to_ens, vocab_output_name)
Builds the GeneTokenizer using the median dictionary then serializes and saves the dictionary to disk.
Source code in bionemo/geneformer/data/singlecell/preprocess.py
90 91 92 93 94 95 96 |
|
preprocess()
Preprocesses for the Geneformer model
Source code in bionemo/geneformer/data/singlecell/preprocess.py
103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 |
|
GeneformerResourcePreprocessor
dataclass
Bases: ResourcePreprocessor
ResourcePreprocessor for the Geneformer model. Downloads the gene_name_id_dict.pkl and gene_median_dictionary.pkl files.
Source code in bionemo/geneformer/data/singlecell/preprocess.py
37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 |
|
prepare_resource(resource)
Logs and downloads the passed resource.
resource: RemoteResource - Resource to be prepared.
Returns - the absolute destination path for the downloaded resource
Source code in bionemo/geneformer/data/singlecell/preprocess.py
61 62 63 64 65 66 67 68 |
|