Pre-training ESM-2
Pre-trained checkpoints for ESM-2 are available at the 8M, 650M, and 3B model sizes. These models were trained by the bionemo-framework team to reproduce the original training results from Lin et al, Science (2023), with more recent UniProt data and leveraging the bionemo training infrastructure. The full pre-training data and train/test splits are available.
Model Convergence
Validation perplexity evaluated on the NVIDIA validation set.
Model Size | Perplexity at 500k updates |
---|---|
8M | 10.26 |
650M | 7.14 |
3B | 6.42 |
Pre-training recipes
esm2_8m_ckpt_path = load("esm2/nv_8m:2.0")
Training Script
Training Parameters | Value |
---|---|
# of GPUs | 32 |
GPU Type | A100 |
Batch size (per device) | 64 |
train_esm2 \
--create-tensorboard-logger \
--resume-if-exists \
--wandb-project=<wandb-project-name> \
--save-top-k=10 \
--train-cluster-path=/data/train_clusters.parquet \ # (1)!
--train-database-path=/data/train.db \
--valid-cluster-path=/data/valid_clusters.parquet \
--valid-database-path=/data/validation.db \
--num-steps=500_000 \
--metric-to-monitor-for-checkpoints=val_loss \
--micro-batch-size=64 \
--num-nodes=4 \
--num-gpus=8 \
--val-check-interval=10000 \
--limit-val-batches=1.0 \
--result-dir=/results/esm2_pretrain_8m \
--experiment-name=esm2_pretrain_8m \
--num-layers=6 \
--hidden-size=320 \
--num-attention-heads=20 \
--ffn-hidden-size=1280;
- Paths here must be mounted into the
bionemo-framework
docker image.
esm2_650m_ckpt_path = load("esm2/nv_650m:2.1")
Training Script
Training Parameters | Value |
---|---|
# of GPUs | 64 |
GPU Type | H100 |
Batch size (per device) | 32 |
train_esm2 \
--create-tensorboard-logger \
--resume-if-exists \
--wandb-project=<wandb-project-name> \
--save-top-k=10 \
--train-cluster-path=/data/train_clusters.parquet \ # (1)!
--train-database-path=/data/train.db \
--valid-cluster-path=/data/valid_clusters.parquet \
--valid-database-path=/data/validation.db \
--num-steps=500_000 \
--metric-to-monitor-for-checkpoints=val_loss \
--micro-batch-size=32 \
--num-nodes=8 \
--num-gpus=8 \
--val-check-interval=10000 \
--limit-val-batches=1.0 \
--result-dir=/results/esm2_pretrain_650m \
--experiment-name=esm2_pretrain_650m \
--min-seq-length=1024 \
--max-seq-length=1024 \
--num-layers=33 \
--hidden-size=1280 \
--num-attention-heads=20 \
--ffn-hidden-size=5120;
- Paths here must be mounted into the
bionemo-framework
docker image.
esm2_3b_ckpt_path = load("esm2/nv_3b:2.1")
Training Script
Training Parameters | Value |
---|---|
# of GPUs | 128 |
GPU Type | H100 |
Batch size (per device) | 16 |
warmup steps | 20,000 |
train_esm2 \
--create-tensorboard-logger \
--resume-if-exists \
--wandb-project=<wandb-project-name> \
--save-top-k=10 \
--train-cluster-path=/data/train_clusters.parquet \ # (2)!
--train-database-path=/data/train.db \
--valid-cluster-path=/data/valid_clusters.parquet \
--valid-database-path=/data/validation.db \
--num-steps=500_000 \
--warmup-steps=20_000 \ # (1)!
--metric-to-monitor-for-checkpoints=val_loss \
--micro-batch-size=16 \
--num-nodes=16 \
--num-gpus=8 \
--val-check-interval=2500 \
--limit-val-batches=1.0 \
--result-dir=/results/esm2_pretrain_3b \
--experiment-name=esm2_pretrain_3b \
--min-seq-length=1024 \
--max-seq-length=1024 \
--num-layers=36 \
--hidden-size=2560 \
--num-attention-heads=40 \
--ffn-hidden-size=10240;
-
We had to increase the number of warmup steps 10x over the published training recipe for ESM-2 3B, which was likely trained with fp16 precision. This gave us an overall similar initial curve, but avoided convergence issues at around 2,000 steps.
-
Paths here must be mounted into the
bionemo-framework
docker image.