Evo2 Data Preparation
Data Preprocessing
To streamline the process of preparing and building datasets for training Evo2 on DNA sequences, we provide a configurable preprocessing script (preprocess.py
) that can preprocess and tokenize a collection of .fasta
files and convert them into Megatron-compatible IndexedDataset
.
preprocess_evo2 -c <CONFIG_PATH>
bionemo-evo2
, then you can run the script directly:
python sub-packages/bionemo-evo2/src/bionemo/evo2/data/preprocess.py -c <CONFIG_PATH>
Configuration YAML parameters for the script can be found in utils/config.py
:
class Evo2PreprocessingConfig(BaseModel):
"""Pydantic model class specifying the configuration schema for a preprocessed IndexedDataset (.bin, .idx)."""
# Collection of FASTA files to preprocess and wrap into a single IndexedDataset.
datapaths: list[Path] = []
# Output directory for the preprocessed dataset .bin/.idx.
output_dir: None | Path = None
# Output file prefix for identifying your datasets.
output_prefix: None | str = None
# Random Sequence-Level Datasplit
train_split: float = 0.7
valid_split: float = 0.2
test_split: float = 0.1
# Overwrite existing binaries. Otherwise, skip already preprocessed datasets.
overwrite: bool = False
# Raw Preprocessing Transforms
# For every sequence, include a reverse-complemented copy of that sequence in the dataset. Doubles the size of the dataset.
embed_reverse_complement: bool = False
# For every sequence, randomly reverse complement the sequence with the specified probability instead of using the original sequence.
random_reverse_complement: float = 0.0
# For sequences associated with taxonomic lineages specified in `taxonomy_data`, randomly drop out nodes of the lineage with the specified probability. For instance: |d__KINGDOM;p__None;c__CLASS;o__None;f__None;g__None;s__None|
random_lineage_dropout: float = 0.0
# Transcribe (DNA -> RNA) or Back-Transcribe (RNA -> DNA) the sequence before tokenization.
transcribe: None | Literal["transcribe", "back_transcribe"] = None
# Force upper-case alphabetical characters in the `.fasta` sequences.
force_uppercase: bool = False
# Data type of the IndexedDataset. When using the byte-level tokenizer, uint8 is more than sufficient with a vocabulary size of 255 for ASCII.
indexed_dataset_dtype: str = "uint8"
# Tokenization Transforms
# Append end-of-document token to the end of each sequence.
append_eod: bool = False
# Enforce the length of the sequence, by padding shorter sequences and raising exceptions when the length is exceeded.
enforce_sample_length: None | int = None
# Run ftfy on the sequence characters prior to tokenization to fix encoding issues.
ftfy: bool = False
# Tokenizer
tokenizer_type: Literal[
"Byte-Level",
"HuggingFace",
"SentencePiece",
"Regex",
"Megatron",
"Tiktoken",
] = "Byte-Level" # Recommended for DNA / RNA sequences. All other tokenizers have not been tested, and only supported here for experimentation!
# For more information on the behavior of the following parameters, refer to NeMo:
# https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/nlp/modules/common/tokenizer_utils.py
vocab_file: None | Path = None
vocab_size: None | int = 512
merges_file: None | Path = None
tokenizer_model_name: None | str = None
pretrained_tokenizer_model: None | str = None
special_tokens: None | dict[str, str] = {}
fast_hf_tokenizer: bool = False
# Compute Configuration
# NOTE: If preprocessing a large amount of short individual sequences (< 1000 bp), do NOT use
# multiprocessing (workers > 1) because sequence-level parallel IPC will dominate the preprocessing time!
workers: int = 1
# Number of sequences to load into memory at any given time during preprocessing.
# Prevents OOM while doing sequence-parallel.
preproc_concurrency: int = 100000
chunksize: int = 1
# Data Filters
drop_empty_sequences: bool = False
# If `NNN` is detected in the sequence, drop it from the preprocessed dataset.
nnn_filter: bool = False
# RNG
seed: None | int = None
# Evo2 Taxonomic Lineage Tags
# SeqID Sub-String Indexing: "ABC" will have taxonomy data from "A".
taxonomy_data: dict[str, Evo2TaxonomyLineage] = {}
# Periodicity of injecting phylogenetic lineage tags in the sequence prior to tokenization.
prompt_spacer_length: int = 131072
Furthermore, the taxonomy_data
field contains a map from sequence ID substrings to phylogenetic lineage data of the form:
class Evo2TaxonomyLineage(BaseModel):
"""Pydantic model class that defines the source lineage of a DNA sequence."""
kingdom: None | str = None
phylum: None | str = None
clazz: None | str = None
order: None | str = None
family: None | str = None
genus: None | str = None
species: None | str = None
# (Example) Escherichia coli
|d__Bacteria;p__Pseudomonadota;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Escherichia;s__Escherichia coli|ATCGTACGTACATCTCTA...
Testing
To test equivalence with the reference implementation we first downloaded source-of-truth preprocessed Megatron IndexedDataset
containing promoters data:
$ ls -lah
-rwxr-xr-x 1 bionemo bionemo 1.2M Dec 4 00:56 data_promoters_test_text_CharLevelTokenizer_document.bin
-rwxr-xr-x 1 bionemo bionemo 20K Dec 4 00:56 data_promoters_test_text_CharLevelTokenizer_document.idx
-rwxr-xr-x 1 bionemo bionemo 392M Dec 4 00:56 data_promoters_train_text_CharLevelTokenizer_document.bin
-rwxr-xr-x 1 bionemo bionemo 6.6M Dec 4 00:56 data_promoters_train_text_CharLevelTokenizer_document.idx
-rwxr-xr-x 1 bionemo bionemo 1.2M Dec 4 00:56 data_promoters_valid_text_CharLevelTokenizer_document.bin
-rwxr-xr-x 1 bionemo bionemo 20K Dec 4 00:56 data_promoters_valid_text_CharLevelTokenizer_document.idx
Next we acquired the .fasta
file that was used to generate this, and configured our scripts to preprocess the sequence data into equivalent Megatron IndexedDataset
.
# mmseqs_promotors_config.yaml
- datapaths: ["/workspace/bionemo2/data/mmseqs_results_rep_seq_distinct.fasta"]
output_dir: "/workspace/bionemo2/data"
output_prefix: promoters_uint8_distinct
train_split: 1.0 # We're just going to dump everything into a single file and compare against the union of the 3 splits in the SoT.
valid_split: 0.0
test_split: 0.0
overwrite: True
embed_reverse_complement: true
random_reverse_complement: 0.0
random_lineage_dropout: 0.0
include_sequence_id: false
transcribe: "back_transcribe"
force_uppercase: true
indexed_dataset_dtype: "uint8"
tokenizer_type: "Byte-Level"
vocab_file: null
vocab_size: null
merges_file: null
pretrained_tokenizer_model: null
special_tokens: null
fast_hf_tokenizer: true
append_eod: true
enforce_sample_length: null
ftfy: false
workers: 1
preproc_concurrency: 100000
chunksize: 25
drop_empty_sequences: true
nnn_filter: true
seed: null # Not relevant because we are not using random reverse complement or lineage dropout.
To run the preprocessing script, we ran the following command:
$ python preprocess.py -c mmseqs_promotors_config.yaml
To check equivalence of the two preprocessed datasets, we verify that we get the same elements out of our processed dataset as the original, but do not enforce ordering of the data. (bionemo-noodles
does not sequentially read the .fasta
file.)
>>> from megatron.core.datasets.indexed_dataset import IndexedDataset
>>> ds_train_ref = IndexedDataset("./data_promoters_train_text_CharLevelTokenizer_document")
>>> ds_val_ref = IndexedDataset("./data_promoters_valid_text_CharLevelTokenizer_document")
>>> ds_test_ref = IndexedDataset("./data_promoters_test_text_CharLevelTokenizer_document")
>>> ds_train_ours = IndexedDataset("./promoters_uint8_distinct_byte-level_train")
>>> len(ds_train_ours) == len(ds_train_ref) + len(ds_test_ref) + len(ds_val_ref)
True
>>> # Example of what one of these set elements looks like, it's just a string representation of the token list for an
>>> # element of the training dataset. We can then compare all of these to make sure that the two datasets have the
>>> # same set of samples.
>>> ','.join([str(t) for t in ds_train_ref[0]])
'67,84,71,71,65,71,67,67,84,71,65,67,67,65,84,65,65,71,84,65,71,84,71,71,67,84,65,84,65,65,67,71,65,71,71,65,65,71,65,65,71,65,84,71,65,65,71,65,71,65,84,84,65,71,65,71,65,65,65,65,84,71,65,65,84,71,84,84,67,84,84,71,65,65,71,84,65,71,67,67,65,84,84,71,84,84,71,84,65,71,84,84,71,84,84,71,84,71,84,71,84,71,84,65,84,71,84,84,71,65,71,65,84,71,84,84,84,84,71,71,71,71,84,84,84,71,84,84,65,84,65,84,65,71,65,71,65,71,65,71,65,84,71,84,65,71,84,84,84,71,71,84,71,65,65,71,65,71,84,65,71,71,65,84,84,67,84,67,84,84,65,67,84,65,71,84,71,84,71,65,65,71,65,84,84,65,84,84,65,67,84,65,71,71,84,65,65,67,84,65,65,65,84,71,65,71,65,84,84,67,84,65,84,67,65,65,67,84,65,65,71,84,67,65,84,84,65,71,65,71,65,84,84,71,71,65,65,65,84,71,84,84,84,67,84,84,84,84,65,71,71,84,84,84,65,65,84,65,65,65,71,84,84,84,71,84,84,84,71,65,65,84,84,71,65,71,65,65,65,71,65,71,65,71,65,71,71,65,71,65,71,65,67,65,84,84,71,67,84,84,84,71,65,65,71,71,71,65,71,65,71,84,84,84,71,71,71,84,71,71,71,84,71,65,71,71,65,84,84,71,65,65,65,65,84,71,65,65,65,65,65,84,71,65,65,67,84,71,65,65,65,65,65,71,71,84,71,84,84,65,84,65,71,84,71,65,67,67,84,71,84,67,65,65,65,65,65,65,71,67,84,71,84,71,65,65,71,65,65,71,84,71,84,84,65,84,67,67,65,65,71,65,65,65,84,65,84,71,71,65,84,84,71,67,84,65,65,84,67,65,84,65,67,84,65,67,84,71,84,84,67,65,84,84,65,84,71,65,84,84,84,84,65,84,71,84,71,84,67,65,84,71,84,71,84,71,84,71,67,67,84,65,84,67,65,84,67,65,84,84,67,67,84,84,65,84,65,84,84,84,84,65,71,84,84,71,71,67,65,65,65,65,65,65,65,65,65,65,65,71,65,67,84,84,71,71,65,65,71,84,65,84,84,71,65,65,65,65,67,67,65,65,65,84,67,84,71,65,84,67,84,67,65,65,67,67,84,65,71,65,67,65,65,71,84,67,71,65,84,84,65,65,65,71,67,84,65,65,65,67,67,71,65,65,65,65,67,67,71,65,65,84,67,67,67,71,65,67,67,71,71,84,84,65,65,84,84,71,65,65,65,65,67,67,71,65,84,67,67,65,0'
>>> # Create a set of all of these elements:
>>> all_ref_data = {','.join([str(t) for t in rec]) for ds in [ds_train_ref, ds_val_ref, ds_test_ref] for rec in ds}
>>> # Verify that there is no redundancy so we can do set equality safely
>>> len(all_ref_data) == len(ds_train_ours)
True
>>> len(all_ref_data)
343504
>>> all_our_data = {','.join([str(t) for t in rec]) for ds in [ds_train_ours] for rec in ds}
>>> len(all_our_data)
343504
>>> # Verify set equality to show that we have processed an identical dataset
>>> # (ignoring shuffling order and train/test/val splits)
>>> all_our_data == all_ref_data
True
Sequence Splicing & Stitching
Evo2 has also been trained on spliced DNA and mRNA sequences, where introns are removed leaving only the concatenated exons of the genome. Moreover, "stitched" variants of spliced transcripts have been introduced into Evo2's training dataset, which include 1024 bp of sequence from the promoter and 32 bp around each exon.
To perform splicing or "stitched" splicing on sequences in a FASTA file given an associated gene transfer format (GTF) file, execute the following command:
$ splice_evo2 --help
usage: splice_evo2 [-h] --fasta-path FASTA_PATH --gtf-path GTF_PATH [--output-path OUTPUT_PATH] [--transcript-type {default,stitched}] [--stitched-promoter STITCHED_PROMOTER] [--stitched-intron STITCHED_INTRON] [--stitched-overlap] [--only-longest-transcript] [-v]
Extract spliced transcripts from a FASTA and GTF.
options:
-h, --help show this help message and exit
--fasta-path FASTA_PATH
Path to FASTA file to extract transcripts from.
--gtf-path GTF_PATH Path to gene transfer format (GTF) file associated with the FASTA.
--output-path OUTPUT_PATH
Path to output FASTA file.
--transcript-type {default,stitched}
Type of transcript to extract from the GTF and FASTA files for splicing. 'Stitched' transcripts include 1024 bp of sequence from the promoter and 32 bp around each exon.
--stitched-promoter STITCHED_PROMOTER
Number of bp to include in the promoter region when --transcript-type=stitched is used. Defaults to 1024.
--stitched-intron STITCHED_INTRON
Number of bp to include from neighboring introns when --transcript-type=stitched is used. Defaults to 32.
--stitched-overlap Allow overlap of neighboring intron windows when --transcript-type=stitched is used. Defaults to False, i.e. prevents overlap by shortening the intron windows for a contiguous splice.
--only-longest-transcript
Only extract the longest transcript per gene.
-v, --verbose Turn on verbose log messages.