ESM-2

Model Overview

Description

ESM-2 is a pre-trained, bi-directional encoder (BERT-style model) over amino acid sequences. ESM-2 models provide embeddings for amino acids that have led to state-of-the-art performance on downstream tasks such as structure and function prediction. ESM-2 has been trained at a number of different model sizes. BioNeMo2 includes converted checkpoints for the 650M and 3B parameter variants. The 650M model has 33 layers, 20 attention heads, and a hidden space dimension of 1280. The 3B model has 36 layers, 40 attention heads, and a hidden space dimension of 2,560.

These models are ready for commercial use.

Third-Party Community Consideration

This model is not owned or developed by NVIDIA. This model has been developed and built to a third-party’s requirements for this application and use case [1]; see link to Non-NVIDIA Model Card for ESM-2 3B model and non-NVIDIA Model Card for ESM-2 650M model

References

[1] Lin, Z., Akin, H., Rao, R., Hie, B., Zhu, Z., Lu, W., Smetanin, N., Verkuil, R., Kabeli, O., Shmueli, Y. and dos Santos Costa, A., 2023. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 379(6637), pp.1123-1130.

[2] "UniProt: the universal protein knowledgebase in 2021." Nucleic acids research 49, no. D1 (2021): D480-D489.

[3] Devlin, J., Chang, M.W., Lee, K. and Toutanova, K., 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

Model Architecture

Architecture Type: BERT

Network Architecture: ESM-2

Input

Input Type(s): Text (Protein Sequences)

Input Parameters: 1D

Other Properties Related to Input: Protein sequence represented as a string of canonical amino acids, of maximum length 1022. Longer sequences are automatically truncated to this length.

Output

Output Type(s): Embeddings (Amino-acid and sequence-level)

Output Parameters: 1D

Other Properties Related to Output: Numeric vector with floating-point values corresponding to an embedding for each amino acid in the input protein sequence. Maximum output length is 1022 embeddings - one embedding vector per amino acid.

Software Integration

Runtime Engine(s)

BioNeMo, NeMo, Megatron, TransformerEngine

Supported Hardware Microarchitecture Compatibility

[Ampere]
[Hopper]
[Volta]

[Preferred/Supported] Operating System(s)

[Linux]

Model Version(s)

Training & Evaluation

Training Dataset

Original ESM-2 checkpoints from HuggingFace were trained with the UniProt 2021_04 sequence database. For more details on the training dataset, see Lin et al. 2023. The train / test splits used by the original authors were not distributed. A pre-training database compiled by NVIDIA following a similar approach is described in UniProt Dataset.

Inference

Engine: BioNeMo, NeMo

Test Hardware

[Ampere]
[Hopper]
[Volta]

License

ESM-2 is as provided under the Apache 2.0 license.

Competitive Benchmarking

Accuracy

A validation set of 328,360 UniRef50 representative sequences were randomly selected from UniRef 2024_03 (see UniProt Dataset). This validation set was used to ensure that the output of BioNeMo-converted checkpoints is consistent with their outputs when evaluated with the HuggingFace Transformers library.

Checkpoint	HuggingFace	BioNeMo2	Lin et al. 2023
650M	7.001	7.002	6.95
3B	6.003	6.004	6.49

Different Validation Sets

The HuggingFace and converted BioNeMo2 checkpoints were evaluated on a newly curated validation set. Perplexities from Lin et al. 2023 are reported for comparison, but the original train/test splits are not available.

Training Performance

Single-node Training Performance

The pure-pytorch baseline (compiled with torch.compile()) raised an out-of-memory error for batch sizes larger than 16 at the ESM2-650M model size. The bionemo2 model could handle batch sizes of 46, reaching a model flops utilization of 59.2% on an NVIDIA A100.

Model Scaling

Training ESM-2 at the 650M, 3B, and 15B model variants show improved performance with the BioNeMo2 framework over the pure-pytorch baseline. These experiments were conducted on 16x NVIDIA A100 or 16x NVIDIA H100 GPUs split across two nodes.

Device Scaling

Training ESM-3B on 256 NVIDIA A100s on 32 nodes achieved 96.85% of the theoretical linear throughput expected from extrapolating single-node (8 GPU) performance, representing a model flops utilization of 60.6% at 256 devices.