Skip to content

Eden Recipe

Eden is a family of genomic models that use the Llama 3.1 architecture, developed by Basecamp Research. Models range from 100M to 35B parameters.

Reference: Eden by Basecamp Research.

Installation

./.ci_build.sh  # build the virtualenv
source ./.ci_test_env.sh  # source the virtualenv

CLI tools

Command Description
train_eden Train or fine-tune Eden models
infer_eden Autoregressive text generation (greedy/sampling)
predict_eden Batch log-likelihood scoring on FASTA sequences
eden_convert_nemo2_to_mbridge Convert NeMo2 checkpoints to MBridge DCP format
eden_export_mbridge_to_hf Export Eden MBridge checkpoint to HuggingFace Llama
eden_convert_hf_to_mbridge Convert HuggingFace Llama checkpoint to Eden MBridge
eden_remove_optimizer Strip optimizer state from an MBridge checkpoint

Quick start

Training with mock data

torchrun --nproc-per-node 1 --no-python \
  train_eden \
  --hf-tokenizer-model-path tokenizers/nucleotide_fast_tokenizer_256 \
  --model-size eden_7b --num-layers 2 --max-steps 5 --eval-interval 5 \
  --eval-iters 1 --mock-data \
  --micro-batch-size 4 --global-batch-size 4 --seq-length 64 \
  --tensor-model-parallel-size 1 --pipeline-model-parallel-size 1 --context-parallel-size 1 \
  --mixed-precision-recipe bf16_mixed \
  --no-activation-checkpointing \
  --decay-steps 1000 --warmup-steps 10 \
  --log-interval 1 --seed 41 --dataset-seed 33 \
  --result-dir eden_test

Note: fp32_residual_connection is automatically set to False for Eden/TE layers.

Training with sharded Eden data

For production training, use --sharded-eden-data with pre-sharded SQLite sequence databases and precomputed window databases. See src/bionemo/eden/data/sharded_eden_dataloader.md for the full data schema, directory structure, and pre-processing workflow.

torchrun --nproc-per-node 8 --no-python \
  train_eden \
  --hf-tokenizer-model-path tokenizers/nucleotide_fast_tokenizer_256 \
  --model-size eden_7b --max-steps 100000 --eval-interval 500 \
  --eval-iters 32 \
  --sharded-eden-data \
  --sequence-db-dir /path/to/sequence_dbs \
  --train-window-db /path/to/train_windows.db \
  --val-window-db /path/to/val_windows.db \
  --test-window-db /path/to/test_windows.db \
  --micro-batch-size 1 --global-batch-size 64 --seq-length 8192 \
  --tensor-model-parallel-size 4 --pipeline-model-parallel-size 1 --context-parallel-size 1 \
  --mixed-precision-recipe bf16_mixed \
  --warmup-steps 2500 --decay-steps 97500 \
  --log-interval 10 --seed 41 --dataset-seed 33 \
  --result-dir /path/to/results

The --stride (default 7992) and --window-min-length-threshold (default 0) flags control how windows are sampled. Use --rc-aug to enable reverse-complement augmentation.

Fine-tuning from a checkpoint

Resume training from an existing MBridge checkpoint using --finetune-ckpt-dir:

torchrun --nproc-per-node 8 --no-python \
  train_eden \
  --hf-tokenizer-model-path tokenizers/nucleotide_fast_tokenizer_256 \
  --model-size eden_7b --max-steps 10000 --eval-interval 500 \
  --eval-iters 32 --mock-data \
  --finetune-ckpt-dir /path/to/mbridge/checkpoint \
  --micro-batch-size 1 --global-batch-size 64 --seq-length 8192 \
  --tensor-model-parallel-size 4 --pipeline-model-parallel-size 1 --context-parallel-size 1 \
  --mixed-precision-recipe bf16_mixed \
  --warmup-steps 500 --decay-steps 9500 \
  --lr 1e-4 --min-lr 1e-5 \
  --log-interval 10 --seed 41 \
  --result-dir /path/to/finetune_results

The checkpoint directory can contain iter_* subdirectories or be a direct checkpoint directory with run_config.yaml. Use eden_remove_optimizer first if you only need the model weights.

Convert: NeMo2 to MBridge

Convert a NeMo2 DCP checkpoint to MBridge format for use with train_eden, infer_eden, and the other MBridge-based tools:

eden_convert_nemo2_to_mbridge \
  --nemo2-ckpt-dir /path/to/nemo2/checkpoint \
  --tokenizer-path tokenizers/nucleotide_fast_tokenizer_256 \
  --mbridge-ckpt-dir /path/to/eden_mbridge \
  --model-size eden_7b \
  --seq-length 8192 \
  --mixed-precision-recipe bf16_mixed

Autoregressive generation (infer_eden)

torchrun --nproc_per_node 1 --no-python \
  infer_eden \
  --ckpt-dir /path/to/mbridge/checkpoint \
  --prompt "ATCGATCGATCGATCG" \
  --max-new-tokens 200 \
  --temperature 1.0 \
  --output-file generated.txt

Options: --ckpt-dir, --prompt/--prompt-file, --max-new-tokens, --temperature, --top-k/--top-p, --tensor-parallel-size, --max-seq-length (auto-detected by default, override with EDEN_MAX_SEQ_LEN env var).

Batch sequence scoring (predict_eden)

torchrun --nproc_per_node 1 --no-python \
  predict_eden \
  --fasta /path/to/sequences.fasta \
  --ckpt-dir /path/to/mbridge/checkpoint \
  --output-dir predictions/ \
  --micro-batch-size 4 \
  --write-interval epoch

Exporting / importing Eden (Llama) checkpoints

Export: MBridge to HuggingFace

eden_export_mbridge_to_hf \
  --mbridge-ckpt-dir /path/to/eden_mbridge/iter_0000001 \
  --hf-output-dir /path/to/eden_hf \
  --model-size eden_7b

Produces standard HuggingFace directory loadable with LlamaForCausalLM.from_pretrained().

Import: HuggingFace to MBridge

eden_convert_hf_to_mbridge \
  --hf-model-dir /path/to/eden_hf \
  --mbridge-ckpt-dir /path/to/eden_mbridge_reimported \
  --model-size eden_7b

Removing optimizer state from a checkpoint

Training checkpoints include optimizer state (Adam moments, LR scheduler, RNG state) which roughly triples checkpoint size. Use eden_remove_optimizer to produce a smaller weights-only checkpoint suitable for release or fine-tuning:

eden_remove_optimizer \
  --src-ckpt-dir /path/to/training/checkpoints \
  --dst-ckpt-dir /path/to/weights_only_checkpoint

The tool automatically finds the latest iter_* directory, strips optimizer and scheduler state from the DCP files, and copies model weights, tokenizer, and config files to the destination. The resulting checkpoint is directly usable with --finetune-ckpt-dir or the export tools.

Model sizes

Key Description
eden_100m Eden ~100M
eden_300m Eden ~300M
eden_1b Eden ~1B
eden_7b Eden base (~8B params)
eden_11b Eden ~11B
eden_18b Eden ~18B
eden_21b Eden ~21B
eden_24b Eden ~24B (32K context)
eden_27b Eden ~27B (32K context)
eden_28b Eden ~28B
eden_35b Eden ~35B

Data

Eden uses the ShardedEdenDataset from Basecamp Research, backed by SQLite for fast windowed access to genomic sequences. The data utilities are provided by the bionemo-recipeutils sub-package.

Docker build

docker build -t eden_megatron_recipe-$(git rev-parse --short HEAD) .