Skip to content

BioNeMo Framework

Fasta dataset

NVIDIA/bionemo-framework

BioNeMo Framework

NVIDIA/bionemo-framework

Home
User Guide
User Guide
- Getting Started
  Getting Started
- Background
  Background
  - NeMo2 Parallelism
  - Megatron Dataset Considerations
- Developer Guide
  Developer Guide
  - Bionemo core
    Bionemo core
    
    bionemo-core
  - Bionemo esm2
    Bionemo esm2
    
    bionemo-esm2
  - Bionemo evo2
    Bionemo evo2
    
    bionemo-evo2
  - Bionemo example model
    Bionemo example model
    
    bionemo example model Overview
  - Bionemo fw
    Bionemo fw
    
    bionemo-fw
  - Bionemo geneformer
    Bionemo geneformer
    
    bionemo-geneformer
  - Bionemo geometric
    Bionemo geometric
    
    bionemo-geometric
  - Bionemo llm
    Bionemo llm
    
    bionemo-llm
  - Bionemo moco
    Bionemo moco
    
    Modular Co-Design (MoCo) Interpolants
  - Bionemo noodles
    Bionemo noodles
    
    bionemo-noodles
  - Bionemo scdl
    Bionemo scdl
    
    BioNeMo-SCDL: Single Cell Data Loading for Scalable Training of Single Cell Foundation Models.
  - Bionemo size aware batching
    Bionemo size aware batching
    
    bionemo-size-aware-batching
  - Bionemo testing
    Bionemo testing
    
    bionemo-testing
  - Bionemo webdatamodule
    Bionemo webdatamodule
    
    bionemo-webdatamodule
- Tutorials
  Tutorials
  - Conftest
  - Bionemo esm2
    Bionemo esm2
    
    ESM-2 Fine-tuning
    
    ESM-2 Inference
    
    Zero-Shot Protein Design Using ESM-2
    
    ESM-2 Pretraining
  - Bionemo evo2
    Bionemo evo2
    
    Fine tuning tutorial
    
    Zero-shot prediction of BRCA1 variant effects with Evo 2
  - Bionemo geneformer
    Bionemo geneformer
    
    Geneformer Cell Type Classification Benchmark
  - Bionemo moco
    Bionemo moco
    
    Building Generative Models for Continuous Data via Continuous Interpolants
    
    Building Generative Models for Discrete Data via Discrete Interpolants
    
    Optimal Transport Samplers Tutorial
  - Bionemo scdl
    Bionemo scdl
    
    Example notebook
- Contributing
  Contributing
  - Code Review
  - Contributing Guidelines
  - Sub package dependency graph
  - Writing Documentation
    
    Writing Documentation
    
    Jupyter Notebook Support
    
    MkDocs
- Appendix
  Appendix
  - Frequently Asked Questions
  - Release Notes
Models
Models
- Evo2
- Geneformer
- ESM 2
  ESM 2
  - Pre-trained Checkpoints
Datasets
Datasets
- CELLxGENE
- UniProt Dataset
API
API
- Bionemo
  Bionemo
  - Core
    Core
    
    Api
    
    Data
    
    Data
    
    Api
    
    Load
    
    Multi epoch dataset
    
    Permute
    
    Resamplers
    
    Resource
    
    Model
    Model
    
    Config
    
    Utils
    Utils
    
    Batching utils
    
    Dtypes
    
    Random utils
  - Esm2
    Esm2
    
    Api
    
    Data
    Data
    
    Datamodule
    
    Dataset
    
    Tokenizer
    
    Tokenizer
    
    Model
    Model
    
    Convert
    
    Embedding
    
    Model
    
    Finetune
    Finetune
    
    Datamodule
    
    Dataset
    
    Loss
    
    Peft
    
    Sequence model
    
    Token model
    
    Run
    Run
    
    Config models
    
    Main
    
    Recipes
    
    Scripts
    
    Scripts
    
    Finetune esm2
    
    Infer esm2
    
    Train esm2
    
    Testing
    Testing
    
    Compare
  - Evo2
    Evo2
    
    Data
    
    Data
    
    Fasta dataset Fasta dataset
    Table of contents
    
    fasta_dataset
    
    SimpleFastaDataset
    
    __getitem__
    
    __init__
    
    __len__
    
    write_idx_map
    
    Preprocess
    
    Tokenizer
    
    Transcript extraction
    
    Run
    Run
    
    Infer
    
    Predict
    
    Train
    
    Utils
    Utils
    
    Config
    
    Checkpoint
    
    Checkpoint
    
    Convert checkpoint model parallel evo2
    
    Convert to nemo
    
    Convert zero3 to zero1
    
    Params
    
    Zero3 conversion lib
  - Example model
    Example model
    
    Lightning
    Lightning
    
    Lightning basic
    
    Training scripts
    Training scripts
    
    Finetune mnist
    
    Predict mnist
    
    Pretrain mnist
  - Fw
    Fw
    
    Dependency graph
  - Geneformer
    Geneformer
    
    Api
    
    Data
    Data
    
    Preprocess
    
    Singlecell
    Singlecell
    
    Datamodule
    
    Dataset
    
    Preprocess
    
    Utils
    
    Model
    Model
    
    Finetune token regressor
    
    Run
    Run
    
    Config models
    
    Main
    
    Recipes
    
    Scripts
    
    Scripts
    
    Infer geneformer
    
    Train geneformer
    
    Tokenizer
    Tokenizer
    
    Gene tokenizer
  - Geometric
    Geometric
    
    Atom featurizers
    
    Base featurizer
    
    Bond featurizers
    
    Molecule featurizers
  - Llm
    Llm
    
    Api
    
    Lightning
    
    Train
    
    Data
    Data
    
    Collate
    
    Datamodule
    
    Label2id tokenizer
    
    Masking
    
    Types
    
    Model
    Model
    
    Config
    
    Layers
    
    Loss
    
    Lr scheduler
    
    Biobert
    Biobert
    
    Lightning
    
    Model
    
    Testing utils
    
    Transformer specs
    
    Run
    Run
    
    Config models
    
    Utils
    Utils
    
    Callbacks
    
    Datamodule utils
    
    Iomixin utils
    
    Logger utils
    
    Megatron utils
    
    Remote
    
    Weight utils
  - Moco
    Moco
    
    Distributions
    Distributions
    
    Prior
    Prior
    
    Distribution
    
    Continuous
    Continuous
    
    Gaussian
    
    Harmonic
    
    Utils
    
    Discrete
    Discrete
    
    Custom
    
    Mask
    
    Uniform
    
    Time
    Time
    
    Beta
    
    Distribution
    
    Logit normal
    
    Uniform
    
    Utils
    
    Interpolants
    Interpolants
    
    Base interpolant
    
    Batch augmentation
    
    Continuous time
    Continuous time
    
    Continuous
    Continuous
    
    Continuous flow matching
    
    Vdm
    
    Data augmentation
    Data augmentation
    
    Augmentation types
    
    Equivariant ot sampler
    
    Kabsch augmentation
    
    Ot sampler
    
    Discrete
    Discrete
    
    Discrete flow matching
    
    Mdlm
    
    Discrete time
    Discrete time
    
    Utils
    
    Continuous
    Continuous
    
    Ddpm
    
    Discrete
    Discrete
    
    D3pm
    
    Schedules
    Schedules
    
    Inference time schedules
    
    Utils
    
    Noise
    Noise
    
    Continuous noise transforms
    
    Continuous snr transforms
    
    Discrete noise schedules
    
    Testing
    Testing
    
    Parallel test utils
  - Noodles
    Noodles
    
    Nvfaidx
  - Scdl
    Scdl
    
    Api
    Api
    
    Single cell row dataset
    
    Index
    Index
    
    Row feature index
    
    Io
    Io
    
    Single cell collection
    
    Single cell memmap dataset
    
    Scripts
    Scripts
    
    Convert h5ad to scdl
    
    Util
    Util
    
    Async worker queue
    
    Torch dataloader utils
  - Size aware batching
    Size aware batching
    
    Sampler
    
    Utils
  - Testing
    Testing
    
    Callbacks
    
    Lightning
    
    Megatron dataset compatibility
    
    Megatron parallel state utils
    
    Testing callbacks
    
    Torch
    
    Utils
    
    Data
    Data
    
    Esm2
    
    Fasta
    
    Load
    
    Resource
    
    Harnesses
    Harnesses
    
    Mode
    
    Stop and go
  - Webdatamodule
    Webdatamodule
    
    Datamodule
    
    Utils

Fasta dataset

`SimpleFastaDataset`

Bases: Dataset

A simple dataset for Evo2 prediction.

Currently, this will not work for pre-training or fine-tuning, as that would require: 1) including "labels" in the input and 2) offsetting/rolling either the labels or input_ids to handle the off-by-one token prediction alignment.

Source code in bionemo/evo2/data/fasta_dataset.py

class SimpleFastaDataset(torch.utils.data.Dataset):
    """A simple dataset for Evo2 prediction.

    Currently, this will not work for pre-training or fine-tuning, as that would require:
    1) including "labels" in the input and 2) offsetting/rolling either the labels or
    input_ids to handle the off-by-one token prediction alignment.
    """

    def __init__(self, fasta_path: Path, tokenizer, prepend_bos: bool = True):
        """Initialize the dataset."""
        super().__init__()
        self.fasta = NvFaidx(fasta_path)
        self.seqids = sorted(self.fasta.keys())
        self.tokenizer = tokenizer
        self.prepend_bos = prepend_bos  # needed for getting predictions for the requested set of tokens.

    def write_idx_map(self, output_dir: Path):
        """Write the index map to the output directory."""
        with open(output_dir / "seq_idx_map.json", "w") as f:
            json.dump({seqid: idx for idx, seqid in enumerate(self.seqids)}, f)

    def __len__(self):
        """Get the length of the dataset."""
        return len(self.seqids)

    def __getitem__(self, idx: int) -> dict[str, torch.Tensor]:
        """Get an item from the dataset."""
        sequence = self.fasta[self.seqids[idx]].sequence().upper()
        tokenized_seq = self.tokenizer.text_to_ids(sequence)
        if self.prepend_bos:  # in pretraining we use EOS to start new sequences.
            tokens: list[int] = [self.tokenizer.eod] + tokenized_seq
        else:
            tokens: list[int] = tokenized_seq
        loss_mask = torch.ones_like(torch.tensor(tokens, dtype=torch.long), dtype=torch.long)
        if self.prepend_bos:
            loss_mask[0] = (
                0  # mask the eos token which we use for causal offsetting. Later in predict we take the output
            )
            #  for the first [:-1] tokens which align with the sequence starting after the EOS.
        return {
            "tokens": torch.tensor(tokens, dtype=torch.long),
            "position_ids": torch.arange(len(tokens), dtype=torch.long),
            "seq_idx": torch.tensor(idx, dtype=torch.long),
            "loss_mask": loss_mask,
        }

`getitem(idx)`

Get an item from the dataset.

Source code in bionemo/evo2/data/fasta_dataset.py

def __getitem__(self, idx: int) -> dict[str, torch.Tensor]:
    """Get an item from the dataset."""
    sequence = self.fasta[self.seqids[idx]].sequence().upper()
    tokenized_seq = self.tokenizer.text_to_ids(sequence)
    if self.prepend_bos:  # in pretraining we use EOS to start new sequences.
        tokens: list[int] = [self.tokenizer.eod] + tokenized_seq
    else:
        tokens: list[int] = tokenized_seq
    loss_mask = torch.ones_like(torch.tensor(tokens, dtype=torch.long), dtype=torch.long)
    if self.prepend_bos:
        loss_mask[0] = (
            0  # mask the eos token which we use for causal offsetting. Later in predict we take the output
        )
        #  for the first [:-1] tokens which align with the sequence starting after the EOS.
    return {
        "tokens": torch.tensor(tokens, dtype=torch.long),
        "position_ids": torch.arange(len(tokens), dtype=torch.long),
        "seq_idx": torch.tensor(idx, dtype=torch.long),
        "loss_mask": loss_mask,
    }

`init(fasta_path, tokenizer, prepend_bos=True)`

Initialize the dataset.

Source code in bionemo/evo2/data/fasta_dataset.py

def __init__(self, fasta_path: Path, tokenizer, prepend_bos: bool = True):
    """Initialize the dataset."""
    super().__init__()
    self.fasta = NvFaidx(fasta_path)
    self.seqids = sorted(self.fasta.keys())
    self.tokenizer = tokenizer
    self.prepend_bos = prepend_bos  # needed for getting predictions for the requested set of tokens.

`len()`

Get the length of the dataset.

Source code in bionemo/evo2/data/fasta_dataset.py

def __len__(self):
    """Get the length of the dataset."""
    return len(self.seqids)

`write_idx_map(output_dir)`

Write the index map to the output directory.

Source code in bionemo/evo2/data/fasta_dataset.py

def write_idx_map(self, output_dir: Path):
    """Write the index map to the output directory."""
    with open(output_dir / "seq_idx_map.json", "w") as f:
        json.dump({seqid: idx for idx, seqid in enumerate(self.seqids)}, f)