Evo2 Recipe
A self-contained training, inference, and checkpoint conversion recipe for Evo2 genomic foundation models built on Megatron Bridge. This recipe supports the Evo2 (Striped Hyena) architecture through a unified training and inference CLI, along with import and export tools for use in other packages.
Evo2
Evo2 is a family of long-context genomic foundation models based on the Striped Hyena (SSM + Attention) architecture, developed by the Arc Institute. Evo2 models are trained on the OpenGenome2 dataset and scale from 1B to 40B parameters with context lengths up to 1M+ nucleotides. They achieve state-of-the-art performance on gene essentiality prediction, variant effect prediction, and de novo sequence generation across prokaryotic and eukaryotic genomes.
Installation
./.ci_build.sh # build the virtualenv
source ./.ci_test_env.sh # source the virtualenv
CLI tools
All CLI tools are defined in pyproject.toml under [project.scripts].
| Command | Description |
|---|---|
train_evo2 |
Train or fine-tune Hyena models |
infer_evo2 |
Autoregressive text generation (greedy/sampling) |
predict_evo2 |
Batch log-likelihood scoring on FASTA sequences |
preprocess_evo2 |
Convert FASTA files to Megatron indexed binary format |
splice_evo2 |
Extract spliced transcripts from FASTA + GTF files |
evo2_convert_nemo2_to_mbridge |
Convert NeMo2 checkpoints to MBridge DCP format |
evo2_convert_savanna_to_mbridge |
Convert Savanna checkpoints to MBridge DCP format |
evo2_export_mbridge_to_vortex |
Export MBridge checkpoint to Vortex .pt format |
evo2_remove_optimizer |
Strip optimizer state from an MBridge checkpoint |
bionemo_fasta_to_jsonl |
Convert FASTA files to JSONL format |
Run any tool with --help for full usage details.
Quick start
Training with mock data (Hyena)
torchrun --nproc-per-node 2 --no-python \
train_evo2 \
--hf-tokenizer-model-path tokenizers/nucleotide_fast_tokenizer_256 \
--model-size striped_hyena_1b_nv_parallel --max-steps 12 --eval-interval 10 \
--eval-iters 3 --mock-data \
--micro-batch-size 16 --global-batch-size 32 --seq-length 1024 \
--tensor-model-parallel 1 \
--use-precision-aware-optimizer --dataset-seed 33 \
--seed 41 --spike-no-more-embedding-init \
--no-weight-decay-embeddings --cross-entropy-loss-fusion \
--align-param-gather --overlap-param-gather --grad-reduce-in-fp32 \
--decay-steps 100 --warmup-steps 10 \
--mixed-precision-recipe bf16_with_fp8_current_scaling_mixed \
--no-fp32-residual-connection --activation-checkpoint-recompute-num-layers 1 \
--attention-dropout 0.001 --hidden-dropout 0.001 \
--eod-pad-in-loss-mask --enable-preemption \
--log-interval 5 --debug-ddp-parity-freq 10 \
--result-dir tmpfp8 --no-renormalize-loss \
--use-subquadratic-ops
Tip: The
--use-subquadratic-opsflag enables fused subquadratic-ops CUDA kernels (b2b_causal_conv1dfor proj+mixer fusion in prefill,fft_causal_conv1d/causal_conv1dinsideengine.parallel_fir). It applies to training, batch prediction (predict_evo2), and the prefill phase of autoregressive inference (infer_evo2); per-token decode is already in optimal recurrent form and is unaffected.
Autoregressive generation (infer_evo2)
Generate DNA sequences from a prompt using an MBridge checkpoint:
torchrun --nproc_per_node 1 --no-python \
infer_evo2 \
--ckpt-dir /path/to/mbridge/checkpoint \
--prompt "ATCGATCGATCGATCG" \
--max-new-tokens 200 \
--temperature 1.0 \
--output-file generated.txt
Options:
--ckpt-dir— path to MBridge checkpoint directory (required).--prompt/--prompt-file— input sequence (inline or from file).--max-new-tokens— number of tokens to generate (default: 100).--temperature— sampling temperature (default: 1.0).--top-k/--top-p— top-k or nucleus sampling (0 = disabled).--tensor-parallel-size— tensor parallelism for large models (default: 1).--max-seq-length— maximum sequence length (default: 8192).--use-subquadratic-ops— use fused subquadratic-ops kernels for prefill (b2b causal conv, FFT/causal conv1d inparallel_fir). Recommended when processing many prompts in one process.
Batch sequence scoring (predict_evo2)
Compute log-likelihoods for sequences in a FASTA file:
torchrun --nproc_per_node 1 --no-python \
predict_evo2 \
--fasta /path/to/sequences.fasta \
--ckpt-dir /path/to/mbridge/checkpoint \
--output-dir predictions/ \
--micro-batch-size 4 \
--write-interval epoch \
--use-subquadratic-ops
Options:
--fasta— input FASTA file (required).--ckpt-dir— MBridge checkpoint directory (required).--output-dir— directory for output prediction files.--output-log-prob-seqs— output log probabilities instead of raw logits.--log-prob-collapse-option— aggregation:sum,mean, orper_token.--embedding-layer— extract embeddings from a specific layer instead of logits (supports negative indexing, e.g.,-1for last layer).--mask-phylogenetic-tags— mask phylogenetic tags in loss computation.--use-subquadratic-ops— enable fused Hyena convolution kernels for faster scoring (recommended for larger datasets; has a one-time compilation cost).
Data preprocessing (preprocess_evo2)
Convert FASTA files into Megatron's indexed binary format for training:
preprocess_evo2 --config preprocess_config.yaml
The config YAML specifies input FASTA paths, output directory, train/val/test splits,
tokenizer settings, and preprocessing options. See the fine-tuning-tutorial.ipynb
notebook in examples/ for a complete example.
Transcript extraction (splice_evo2)
Extract spliced transcripts from a genome FASTA and GTF annotation:
splice_evo2 \
--fasta-path genome.fa \
--gtf-path annotations.gtf \
--output-path transcripts.fa \
--only-longest-transcript
Options:
--transcript-type—defaultorstitched(includes promoter + intron context).--stitched-promoter— bp to include from promoter region (default: 1024).--stitched-intron— bp from neighboring introns (default: 32).--only-longest-transcript— keep only the longest transcript per gene.
Removing optimizer state from a checkpoint
Training checkpoints include optimizer state (Adam moments, LR scheduler, RNG state)
which roughly triples checkpoint size. Use evo2_remove_optimizer to produce a
smaller weights-only checkpoint suitable for release or fine-tuning:
evo2_remove_optimizer \
--src-ckpt-dir /path/to/training/checkpoints \
--dst-ckpt-dir /path/to/weights_only_checkpoint
The tool automatically finds the latest iter_* directory, strips optimizer and
scheduler state from the DCP files, and copies model weights, tokenizer, and
config files to the destination. The resulting checkpoint is directly usable
with --finetune-ckpt-dir or the export tools.
Fine-tuning from an existing checkpoint
From NeMo2 checkpoints (NGC)
Convert the checkpoint from NeMo2 format, then fine-tune:
CKPT_NAME=evo2/1b-8k-bf16:1.0
CKPT_OUT_DIR=evo2_1b_8k_bf16_mbridge
evo2_convert_nemo2_to_mbridge \
--mixed-precision-recipe bf16_with_fp8_current_scaling_mixed \
--tokenizer-path tokenizers/nucleotide_fast_tokenizer_512 \
--model-size evo2_1b_base \
--seq-length 8192 \
--nemo2-ckpt-dir $(download_bionemo_data $CKPT_NAME) \
--mbridge-ckpt-dir $CKPT_OUT_DIR
Good checkpoint names to try are:
evo2/1b-8k-bf16:1.0(model_size:evo2_1b_base)evo2/7b-1m:1.0(model_size:evo2_7b)evo2/40b-1m-fp8-bf16:1.0(model_size:evo2_40b)
Other than the 7b version, the other two are checkpoints fine-tuned by the BioNeMo team to support both FP8 and BF16
precision. The 7b version worked well on both FP8 and BF16 out of the box so it was not fine-tuned further. If you do
want to use one of the FP8 sensitive checkpoints, like evo2/40b-1m then be sure to add the --vortex-style-fp8
option to the checkpoint conversion step. Also note that although 8k versions of the 7b and 40b checkpoints exist,
it is advisable to use the longer context versions since they were trained further and still run on shorter inputs.
See download_bionemo_data --list-resources for other checkpoint options and a list of available
downloadable resources.
Now fine-tune with --finetune-ckpt-dir. If you have problems with
bf16_with_fp8_current_scaling_mixed try bf16_mixed.
torchrun --nproc-per-node 2 --no-python \
train_evo2 \
--hf-tokenizer-model-path tokenizers/nucleotide_fast_tokenizer_512 \
--model-size evo2_1b_base --max-steps 12 --eval-interval 10 \
--eval-iters 3 --mock-data \
--micro-batch-size 16 --global-batch-size 32 --seq-length 1024 \
--tensor-model-parallel 1 \
--use-precision-aware-optimizer --dataset-seed 33 \
--seed 41 \
--cross-entropy-loss-fusion \
--align-param-gather --overlap-param-gather --grad-reduce-in-fp32 \
--decay-steps 100 --warmup-steps 10 \
--mixed-precision-recipe bf16_with_fp8_current_scaling_mixed \
--no-fp32-residual-connection --activation-checkpoint-recompute-num-layers 1 \
--attention-dropout 0.001 --hidden-dropout 0.001 \
--eod-pad-in-loss-mask --enable-preemption \
--log-interval 5 --debug-ddp-parity-freq 10 \
--result-dir tmpfp8-ft-example --no-renormalize-loss \
--use-subquadratic-ops \
--finetune-ckpt-dir $CKPT_OUT_DIR
From Savanna checkpoints (HuggingFace)
ARC publishes Savanna-format checkpoints on HuggingFace for fine-tuning. Convert to MBridge format first:
evo2_convert_savanna_to_mbridge \
--savanna-ckpt-path arcinstitute/savanna_evo2_7b \
--mbridge-ckpt-dir evo2_7b_mbridge \
--model-size evo2_7b \
--tokenizer-path tokenizers/nucleotide_fast_tokenizer_512 \
--seq-length 1048576
The --savanna-ckpt-path accepts either a local .pt file path or a HuggingFace
repo ID (e.g., arcinstitute/savanna_evo2_1b_base). Available Savanna checkpoints include:
| HuggingFace Repo | Model Size |
|---|---|
arcinstitute/savanna_evo2_1b_base |
evo2_1b_base |
arcinstitute/savanna_evo2_7b_base |
evo2_7b_base |
arcinstitute/savanna_evo2_7b |
evo2_7b |
arcinstitute/savanna_evo2_20b |
evo2_20b |
arcinstitute/savanna_evo2_40b_base |
evo2_40b_base |
arcinstitute/savanna_evo2_40b |
evo2_40b |
Options:
--no-te— disable Transformer Engine fused layernorm key mapping (use if the checkpoint was saved without TE).--mixed-precision-recipe— precision recipe (default:bf16_mixed). NOTE for checkpoints sensitive to FP8 and Hopper you need to run with--mixed-precision-recipe bf16-mixedand also supply the--vortex-style-fp8option for prediction/inference, you should not use the fp8 recipe for those models, as they are sensitive to the exact FP8 configuration they were trained with in savanna, see the table under the section on available nvidia checkpoints for download from NGC.--verbose/-v— enable debug logging.
LoRA Fine-tuning
Evo2LoRA is a LoRA variant built on top of the Megatron Bridge PEFT stack. It
freezes the entire base model and attaches low-rank adapter matrices to the
modules you specify, with an optional escape hatch to keep selected modules
fully trainable.
End-to-end example: see
examples/lora-fine-tuning-tutorial.ipynbfor a runnable walkthrough that fine-tunes the 1B checkpoint for splice-site classification, including a head-only baseline for comparison.
Basic usage
Add --lora-finetune to any train_evo2 command alongside a checkpoint:
torchrun --nproc-per-node 2 --no-python \
train_evo2 \
--hf-tokenizer-model-path tokenizers/nucleotide_fast_tokenizer_512 \
--model-size evo2_1b_base --max-steps 500 --eval-interval 100 \
--eval-iters 3 --mock-data \
--micro-batch-size 4 --global-batch-size 8 --seq-length 1024 \
--mixed-precision-recipe bf16_mixed \
--result-dir lora_run \
--finetune-ckpt-dir $CKPT_OUT_DIR \
--lora-finetune \
--lora-dim 16 \
--lora-alpha 32 \
--lora-dropout 0.1 \
--lora-target-modules "dense_projection,linear_qkv,linear_proj,linear_fc1,linear_fc2"
LoRA configuration flags
| Flag | Default | Description |
|---|---|---|
--lora-finetune |
(absent) | Presence flag. Pass to enable LoRA fine-tuning; omit for standard fine-tuning. |
--lora-dim |
16 |
Rank r of the low-rank decomposition |
--lora-alpha |
32 |
Scaling factor α; effective scale = α/r |
--lora-dropout |
0.1 |
Dropout applied to the LoRA path |
--lora-target-modules |
see below | Comma-separated list of module short-names to attach LoRA adapters to |
--lora-skip-freeze-modules |
"" |
Comma-separated list of module short-names to leave fully trainable (no LoRA, no freeze) |
Default --lora-target-modules: dense_projection,dense,linear_qkv,linear_proj,linear_fc1,linear_fc2
These cover the dense projection inside each Hyena mixer (dense_projection,
dense) and the four standard transformer MLP/attention projections
(linear_qkv, linear_proj, linear_fc1, linear_fc2).
Module name matching
Both --lora-target-modules and --lora-skip-freeze-modules use the same
two-level matching syntax:
- Short name — matches any module whose immediate attribute name equals the
pattern, regardless of depth (e.g.
"mixer"matchesmodel.layers.3.mixer). - Wildcard path — if the pattern contains
*, it is matched against the full dotted path using*as a substring wildcard (e.g."*.layers.0.*.mixer"matches only layer 0).
A module that matches --lora-target-modules will have its base weights frozen
and LoRA adapter matrices attached. A module that matches
--lora-skip-freeze-modules is left entirely unfrozen — its full weight is
trainable — and no LoRA adapter is applied. If a module matches both lists,
Evo2LoRA raises a ValueError at startup.
Weight tying and shared embeddings
Evo2 models default to share_embeddings_and_output_weights=True. Under this
setting, the vocabulary embedding table and the output projection share the
same weight tensor: embedding.word_embeddings.weight owns the data and
output_layer allocates no weight of its own (output_layer.weight is None).
The output layer receives the embedding weight as a runtime argument during the
forward pass.
This has direct consequences when you try to apply LoRA or control freezing on these layers.
Constraint on --lora-target-modules: word_embeddings is a
VocabParallelEmbedding and does not support LoRA adapters in Megatron Bridge.
Including it in --lora-target-modules always raises a ValueError, regardless
of share_embeddings_and_output_weights. output_layer is a
ColumnParallelLinear and does support LoRA, but only when
share_embeddings_and_output_weights=False; when weight tying is enabled
output_layer.weight is None and there is no independent weight tensor to
attach an adapter to.
Design principle for --lora-skip-freeze-modules: Evo2LoRA treats weight
tying as a contract that must be honoured in full. Any configuration that would
change the trainability of only one side of a tied pair is rejected with an error
rather than silently producing asymmetric behaviour.
--lora-target-modules and weight tying
share_embeddings_and_output_weights |
--lora-target-modules includes |
Behavior |
|---|---|---|
| Either | word_embeddings (alone or combined with output_layer) |
Error. VocabParallelEmbedding does not support LoRA adapters. |
True |
output_layer only |
Error. output_layer.weight is None when weight tying is enabled. |
False |
output_layer only |
Valid — LoRA adapter on the independent output projection. |
--lora-skip-freeze-modules and weight tying
share_embeddings_and_output_weights |
--lora-skip-freeze-modules includes |
Behavior |
|---|---|---|
False |
word_embeddings only |
Embedding weight is fully trainable. Output projection is frozen unless also listed. |
False |
output_layer only |
Output projection weight is fully trainable. Embedding is frozen unless also listed. |
False |
both | Both weights are fully trainable. |
True |
word_embeddings only |
Error. Listing only one side of a tied pair breaks the weight-tying invariant. Both must be listed together. |
True |
output_layer only |
Error. Listing only one side of a tied pair breaks the weight-tying invariant. Both must be listed together. |
True |
both | Accepted. The shared weight (owned by word_embeddings) is unfrozen, so both the embedding lookup and the output projection train via the same tensor. Note: because output_layer allocates no weight of its own, gradient flow through the output projection path back to the shared tensor is a TODO item and may not be fully wired in all pipeline-parallel configurations. |
Recommendations
- Default (vocabulary weights frozen, LoRA on inner layers): omit both
embedding/output modules from both flags. The default
--lora-target-modulesdoes not touch either layer. - Apply LoRA to the output projection (untied models only): list
output_layerin--lora-target-modulesand setshare_embeddings_and_output_weights=Falsein the model config. - Fully fine-tune the vocabulary weight alongside LoRA on inner layers:
list both
word_embeddingsandoutput_layerin--lora-skip-freeze-modules.--lora-skip-freeze-modules "word_embeddings,output_layer" - Never put
word_embeddingsin--lora-target-modules—VocabParallelEmbeddingdoes not support LoRA adapters and will raise aValueError. - Never list only one of the two tied layers in
--lora-skip-freeze-moduleswhenshare_embeddings_and_output_weights=True— the invariant is that tied weights are always treated as a unit, and any asymmetric configuration will raise an error.
Running inference on a LoRA checkpoint
A LoRA training checkpoint contains only adapter tensors — the base model weights
are not duplicated. Point --ckpt-dir at the LoRA iter_* directory as usual:
torchrun --nproc_per_node 1 --no-python \
infer_evo2 \
--ckpt-dir </path/to/lora_run/checkpoints/> \
--prompt "ATCGATCGATCGATCG" \
--max-new-tokens 200
torchrun --nproc_per_node 1 --no-python \
predict_evo2 \
--fasta <path/to/fasta/sequences> \
--ckpt-dir </path/to/lora_run/checkpoints/> \
--output-dir ./predictions
When infer_evo2 / predict_evo2 detect a peft section in the checkpoint's
run_config.yaml, they:
- load dense base weights from
checkpoint.pretrained_checkpoint(the same value that was supplied during LoRA training), - apply the stored PEFT config (
run_config["peft"]) to graftLoRALinearwrappers onto the base modules, - load only the adapter tensors from
--ckpt-dir.
No merge step is required. The base checkpoint referenced by
pretrained_checkpoint must still exist on disk at the path recorded in
run_config.yaml.
Exporting to Vortex format
Vortex is ARC Institute's inference format for Evo2 Hyena models, used by the
evo2 inference repository. Export an MBridge
checkpoint to Vortex (.pt) using:
evo2_export_mbridge_to_vortex \
--mbridge-ckpt-dir /path/to/mbridge/iter_0000001 \
--output-path /path/to/output/model_vortex.pt \
--model-size evo2_1b_base
The exporter converts MBridge distributed-checkpoint weights into the single-file Vortex format expected by ARC's inference code. It handles MLP weight splitting, Hyena filter pole/residue computation, and layer-norm key remapping.
Options:
--model-size— one of theevo2_*orstriped_hyena_*Hyena model keys listed below.--no-te— disable Transformer Engine fused layernorm key mapping (use if the checkpoint was saved without TE).--verbose/-v— enable debug logging.
Savanna → MBridge → Vortex round-trip
If you have a Savanna checkpoint and want to produce a Vortex file, chain the two converters:
# Step 1: Savanna -> MBridge
evo2_convert_savanna_to_mbridge \
--savanna-ckpt-path arcinstitute/savanna_evo2_1b_base \
--mbridge-ckpt-dir /tmp/mbridge_1b \
--model-size evo2_1b_base \
--tokenizer-path tokenizers/nucleotide_fast_tokenizer_256
# Step 2: MBridge -> Vortex
evo2_export_mbridge_to_vortex \
--mbridge-ckpt-dir /tmp/mbridge_1b/iter_0000001 \
--output-path /tmp/evo2_1b_vortex.pt \
--model-size evo2_1b_base
Model naming convention
Model sizes are specified via --model-size and follow a naming convention that
disambiguates the model architecture, origin, and context length.
Hyena (SSM) models
| Key | Description |
|---|---|
evo2_1b_base |
ARC 1B, 8K context |
evo2_7b_base |
ARC 7B, 8K context |
evo2_7b |
ARC 7B, 1M context |
evo2_40b_base |
ARC 40B, 8K context |
evo2_40b |
ARC 40B, 1M context |
striped_hyena_1b_nv |
NVIDIA-modified 1B variant |
striped_hyena_7b_nv |
NVIDIA-modified 7B variant |
striped_hyena_40b_nv |
NVIDIA-modified 40B variant |
striped_hyena_test |
Tiny test model |
striped_hyena_test_nv |
Tiny test model (NV variant) |
striped_hyena_1b_nv_parallel |
NVIDIA 1B variant (parallel) |
Models prefixed with evo2_ match the public ARC checkpoints on
Hugging Face (e.g., arcinstitute/savanna_evo2_1b_base). The _base
suffix denotes the 8K-context variant; without it, the model uses the
long (1M) context length. Models prefixed with striped_hyena_ are
NVIDIA-modified variants that do not have a corresponding public ARC
checkpoint.
Examples
The examples/ directory contains Jupyter notebooks demonstrating common workflows:
| Notebook | Description |
|---|---|
zeroshot_brca1.ipynb |
Zero-shot BRCA1 variant effect prediction with Evo2 1B |
fine-tuning-tutorial.ipynb |
Fine-tune the 1B checkpoint on human chromosomes |
lora-fine-tuning-tutorial.ipynb |
LoRA fine-tune the 1B checkpoint for splice-site classification, with a head-only baseline for trainable-param savings |
Docker build
docker build -t evo2_megatron_recipe-$(git rev-parse --short HEAD) .
Performance and accuracy comparisons
Note: This section is largely a work in progress. This reflects the most updated information, but may not reflect the current state of the code base at any given time.
Training accuracy convergence
We ran a 12 hour 48 H100 GPU training run to compare megatron bridge with nemo2. We found that FP8 current scaling converges by around the 5,000th step to the bf16 lines. And that bf16 is comparable with nemo2. Interestingly in nemo2 bf16 and fp8 followed nearly identical trajectories for the first 5k steps as well. Note that in a typical training run we are performing over 100k steps, so different behavior in the first 5k steps is less worrisome if the endpoints are comparable.

Training performance comparisons
FP8 current scaling which is supposed to have better convergence properties than delayed scaling, performs nearly as well as delayed scaling in mbridge. Even leaving multiple transformer layers in bf16 precision trains faster than fp8 delayed scaling in nemo2.
| Evo2 1B Run | Seconds per step (lower is better) | Tokens/sec/GPU | Global Batch Size | Number of GPUs | Vocab Size |
|---|---|---|---|---|---|
| MBridge BF16 | 6.10 | 26,859 | 960 | 48 | 256 |
| MBridge FP8 (delayed) | 5.38 | 30,453 | 960 | 48 | 256 |
| MBridge FP8 (current) | 5.44 | 28,755 | 960 | 48 | 512 |
| MBridge FP8 (current first/last two layers bf16) | 5.47 | 28,598 | 960 | 48 | 512 |
| Nemo2 FP8 (delayed) | 6.18 | 26,511 | 960 | 48 | 512 |
Activation memory optimizations have enabled context parallelism to work better with evo2 style models in our mbridge implementation than the previous nemo2 implementation. Since TP requires more node to node communication, you generally want to limit TP to your fastest interconnects, which are typically configured in nodes of 8 GPUs. Evo2 would previously OOM with these more ideal configurations, requiring much larger than typical levels of TP to handle long context training. With our latest changes to the evo2 forward pass, we can now handle more typical TP vs CP configurations. This enables significantly faster step timing at long context, as well as demonstrating up to 2M context length. We have currently demonstrated small training runs at 2M context on only 512 H100 GPUs for the 40b parameter model.
| Configuration | Precision | TP | CP | Number of Nodes | Number of GPUs | Context Length | Global Batch Size | Seconds per Step |
|---|---|---|---|---|---|---|---|---|
| NeMo2 | fp8-delayed | 64 | 2 | 32 | 256 | 1M | 2 | 44 |
| NeMo2 | fp8-delayed | 8 | 16 | 32 | 256 | 1M | 2 | OOM |
| MBridge Optimized | bf16 | 8 | 16 | 32 | 256 | 1M | 2 | 30 |
| 2M Stress Test | bf16 | 8 | 32 | 64 | 512 | 2M | 2 | 48 |
Available models in NGC (Currently NeMo format so first convert to mbridge)
Note: If you would like to use one of the checkpoints that requires FP8 and Hopper (e.g., that does not work on Blackwell), you need to supply both
--mixed-precision-recipe bf16-mixedto disable the default Megatron FP8 recipes, as well as--vortex-style-fp8which enables the custom FP8 recipe that supports these models. For the robust NVIDIA fine-tuned variants of these models, you can run with FP8 using the available Megatron recipes. Theevo2_7bmodel size does not have these sensitivity issues so it can be executed with Megatron style FP8 or BF16.
| HF Model | BioNeMo Resource Name | Blackwell FP8 | Blackwell BF16 | Hopper FP8 | Hopper BF16 | Ampere | Notes |
|---|---|---|---|---|---|---|---|
| arcinstitute/savanna_evo2_1b_base | evo2/1b-8k:1.0 | ✅ | ❌ | ✅ | ❌ | ❌ | Low accuracy on bf16 (eg ampere) GPUs |
| evo2/1b-8k-bf16:1.0 | ✅ | ✅ | ✅ | ✅ | ✅ | Fine-tuned variant of the 1b-8k that supports bf16 as well as fp8, enabling ampere as well as hopper/blackwell. | |
| arcinstitute/savanna_evo2_7b_base | evo2/7b-8k:1.0 | ✅ | ✅ | ✅ | ✅ | ✅ | The original 7b models have good accuracy across the board at bf16 and fp8 across tested hardware. |
| arcinstitute/savanna_evo2_7b | evo2/7b-1m:1.0 | ✅ | ✅ | ✅ | ✅ | ✅ | The original 7b models have good accuracy across the board at bf16 and fp8 across tested hardware. |
| arcinstitute/savanna_evo2_20b | ? | ? | ✅ | ❌ | ❌ | The 20b model appears to have the same FP8+Hopper support matrix as the 40b model, but we have not tested all configurations thoroughly yet. | |
| arcinstitute/savanna_evo2_40b_base | ? | ? | ? | ? | ? | Unknown, likely has the same support pattern as the 40b-1m row below since this is the same model at an earlier step of training. | |
| arcinstitute/savanna_evo2_40b | ❌ | ❌ | ✅ | ❌ | ❌ | The original 40b-1m context trained model only supports Hopper FP8 | |
| evo2/40b-1m-fp8-bf16:1.0 | ✅ | ✅ | ✅ | ✅ | ✅ | A fine-tuned variant of arcinstitute/savanna_evo2_40b with broad hardware support (fp8 or bf16 and ampere, hopper, and blackwell have all been tested). The original model only has good accuracy on hopper fp8. |
On the CLI you can access the resources in this table (and others) with:
CKPT_PATH=$(download_bionemo_data evo2/40b-1m-fp8-bf16:1.0)
In code these resources can be accessed with:
from bionemo.core.data.load import load
ckpt_path = load("evo2/40b-1m-fp8-bf16:1.0")
Or you can follow the links in the table above to the ngc registry and follow the download links from there.
Note, in the following two sections, the model described as ft1(step199) is the model that was released above as evo2/40b-1m-fp8-bf16:1.0.
Loss evaluation
| device | model_size | is_finetune | fine_tune_desc | precision | ctx_length | average_nll | Notes |
|---|---|---|---|---|---|---|---|
| a100 | 1b | FALSE | None | bf16 | 8192 | 1.242033 | 1b base model works ok on b300, but cannot handle bf16 precision (and by extension ampere) |
| h200 | 1b | FALSE | None | fp8 | 8192 | 1.076465 | |
| b300 | 1b | FALSE | None | fp8 | 8192 | 1.084777 | |
| h200 | 1b | FALSE | None | bf16 | 8192 | 1.243525 | |
| b300 | 1b | FALSE | None | bf16 | 8192 | 1.243527 | |
| a100 | 1b | TRUE | ft | bf16 | 8192 | 1.078681 | 1b base model fine-tuned for bf16 can handle both bf16 and b300. B300 accuracy is also more similar to H200 accuracy after fine-tuning to handle bf16. Ampere appears to work fine as well. |
| h200 | 1b | TRUE | ft | fp8 | 8192 | 1.078623 | |
| b300 | 1b | TRUE | ft | fp8 | 8192 | 1.07901 | |
| h200 | 1b | TRUE | ft | bf16 | 8192 | 1.078671 | |
| b300 | 1b | TRUE | ft | bf16 | 8192 | 1.078694 | |
| a100 | 7b-1m | FALSE | None | bf16 | 8192 | 0.995102 | 7b model got lucky in training and generalizes well to bf16 precision as well as to blackwell and ampere. |
| h200 | 7b-1m | FALSE | None | fp8 | 8192 | 0.995265 | |
| b300 | 7b-1m | FALSE | None | fp8 | 8192 | 0.9951 | |
| h200 | 7b-1m | FALSE | None | bf16 | 8192 | 0.995109 | |
| b300 | 7b-1m | FALSE | None | bf16 | 8192 | 0.99535 | |
| a100 | 40b-1m | FALSE | None | bf16 | 8192 | 1.702023 | 40b model got unlucky in training. It is sensitive to fp8 and within that appears to have memorized the known difference in hopper that leads to lower accuracy when using standard fp8 computations. (see Deepseek V3 paper where they point out the hopper difference in the "Increasing Accumulation Precision" sub-section where hopper uses 14 bits to accumulate partials rather than the typical 32 bits). It does not work well on bf16 and that seems to carry over to ampere as expected. Note if we set (use_split_accumulator=True) to True by setting https://github.com/NVIDIA/TransformerEngine/blob/bd55e7ba5f0235a80eaa63d49adaa8fb7c6ced50/transformer_engine/pytorch/module/base.py#L56 to True then the fp8 is more accurate which breaks fp8 on hopper, making it seem more like blackwell. |
| h200 | 40b-1m | FALSE | None | fp8 | 8192 | 0.922422 | |
| b300 | 40b-1m | FALSE | None | fp8 | 8192 | 1.789 | |
| h200 | 40b-1m | FALSE | None | fp8-delayed(use_split_accumulator=True) | 8192 | 1.791161 | |
| h200 | 40b-1m | FALSE | None | bf16 | 8192 | 1.70015 | |
| b300 | 40b-1m | FALSE | None | bf16 | 8192 | 1.700162 | |
| a100 | 40b-1m | TRUE | ft0 | bf16 | 8192 | 0.962564 | The first fine-tuning run used a global batch size of 4 rather than 16. The training loss curve was very unstable which could have lead to a lower quality fine-tune. This was successful in that every hardware and fp8 precision combination works to some degree. The accuracy sits between the 7b and 40b checkpoints. This is also reflected in a 1% AUC drop on the BRCA1 notebook. https://wandb.ai/nvidia/evo2_40b_finetune/runs/Alp3KXuC/overview. Note that the accuracy on hopper or blackwell bf16 seems to closely track with ampere bf16. |
| h200 | 40b-1m | TRUE | ft0 | fp8 | 8192 | 0.963434 | |
| b300 | 40b-1m | TRUE | ft0 | fp8 | 8192 | 0.95985 | |
| h200 | 40b-1m | TRUE | ft0 | fp8-delayed(use_split_accumulator=True) | 8192 | 0.959287 | |
| h200 | 40b-1m | TRUE | ft0 | bf16 | 8192 | 0.962654 | |
| b300 | 40b-1m | TRUE | ft0 | bf16 | 8192 | 0.962621 | |
| a100 | 40b-1m | TRUE | ft1(step119) | bf16 | 8192 | 0.955813 | The second fine-tuning run has the same accuracy in the BRCA notebook as the original model, and maintains similar accuracy on hopper at fp8 (0.926 vs 0.922). Unfortunately the accuracy drops somewhat on bf16 as well as blackwell, but it is marginally better than the previous fine-tuning run. Accuracy closely tracks between ampere, hopper, and blackwell at bf16. |
| h200 | 40b-1m | TRUE | ft1(step119) | fp8 | 8192 | 0.926986 | |
| b300 | 40b-1m | TRUE | ft1(step119) | fp8 | 8192 | 0.954112 | |
| h200 | 40b-1m | TRUE | ft1(step119) | fp8-delayed(use_split_accumulator=True) | 8192 | 0.953928 | |
| h200 | 40b-1m | TRUE | ft1(step119) | bf16 | 8192 | 0.955881 | |
| b300 | 40b-1m | TRUE | ft1(step119) | bf16 | 8192 | 0.955859 | |
| h200 | 40b-1m | TRUE | ft1(step279) | fp8 | 8192 | 1.379552 | Interestingly if you keep training the model, the accuracy continues to degrade on validation slightly, but note that the model has now shifted its sensitivity away from the fp8 rounding pecularity on hopper to requring the more accurate FP8 implementation on blackwell. Perhaps fine-tuning at a lower learning rate (I used the final minimal learning rate from the pretraining run), with more dropout (I used 0.1% dropout), or more weight decay (I set a very smalll value to nearly disable it rather than how the model was trained at 0.1). https://wandb.ai/nvidia/evo2_40b_finetune/runs/Ji2IRcrz/overview. Note if we set (use_split_accumulator=True) to True by setting https://github.com/NVIDIA/TransformerEngine/blob/bd55e7ba5f0235a80eaa63d49adaa8fb7c6ced50/transformer_engine/pytorch/module/base.py#L56 to True. |
| b300 | 40b-1m | TRUE | ft1(step279) | fp8 | 8192 | 0.958749 | |
| h200 | 40b-1m | TRUE | ft1(step279) | fp8-delayed(use_split_accumulator=True) | 8192 | 0.957551 | |
| h200 | 40b-1m | TRUE | ft1(step279) | bf16 | 8192 | 0.959398 | |
| b300 | 40b-1m | TRUE | ft1(step279) | bf16 | 8192 | 0.959373 |
AUC Evaluation
| device | model_size | is_finetune | fine_tune_desc | precision | BRCA1 SM AUC | BRCA1 Bal AUC | BRCA1 AUC |
|---|---|---|---|---|---|---|---|
| A100 | 40b | TRUE | ft1(step119) | BF16 | 0.86 | ||
| H200 | 40b | TRUE | ft1(step119) | BF16 | |||
| B300 | 40b | TRUE | ft1(step119) | BF16 | |||
| B300 | 40b | TRUE | ft1(step119) | FP8 | 0.87 | ||
| H200 | 40b | TRUE | ft1(step119) | FP8 | 0.88 | ||
| A100 | 40b | TRUE | ft1(step279) | BF16 | 0.86 | ||
| B300 | 40b | TRUE | ft1(step279) | BF16 | |||
| B300 | 40b | TRUE | ft1(step279) | FP8 | |||
| H200 | 40b | TRUE | ft1(step279) | FP8 | 0.5 | ||
| A100 | 7b-1m | FALSE | BF16 | 0.88 | |||
| B300 | 7b-1m | FALSE | FP8 | 0.88 | |||
| H200 | 7b-1m | FALSE | FP8 | 0.88 | |||
| H200 | 40b | TRUE | ft0(step2600) | FP8 | 0.47 | ||
| B300 | 40b | TRUE | ft0(step870) | BF16 | 0.86 | ||
| B300 | 40b | TRUE | ft0(step870) | FP8 | 0.86 | ||
| H200 | 40b | TRUE | ft0(step870) | FP8 | 0.86 | 0.86 | |
| H200 | 40b | FALSE | FP8 | 0.85 | 0.87 | ||
| A100 | 40b | FALSE | BF16 | ||||
| B300 | 40b | FALSE | BF16 | 0.55 | |||
| H200 | 40b | FALSE | BF16 | 0.53 | |||
| B300 | 40b | FALSE | FP8 | 0.48 |