Skip to content

bionemo-geneformer

Geneformer is a foundational single-cell RNA (scRNA) language model using a BERT architecture trained on millions of single-cell RNA sequences. It captures gene co-expression patterns to learn cellular representations, enabling predictive tasks across biology and medicine. Geneformer is trained on a masked language model (MLM) objective, where expression rank-ordered "gene tokens" in single-cell RNA sequences are masked, replaced, or left unchanged, and the model learns to predict these masked genes based on context. This module provides Dataset classes, collators for expression rank ordering, and Config objects for constructing Geneformer-style models.

Setup

To install, execute the following from this directory (or point the install to this directory):

pip install -e .

To run unit tests, execute:

pytest -v .

Acquiring Data

Datasets are expected to be in the form of AnnData (.h5ad) objects such as those downloaded from Cell x Gene | CZI. They are then pre-processed with sub-packages/bionemo-scdl/src/bionemo/scdl/scripts/convert_h5ad_to_scdl.py.

Geneformer-nv 10M and 106M

Refer to the Dataset cards and Model cards to learn more about the pre-trained checkpoints provided for both 10M and 106M of Geneformer-nv.

See Also