bionemo-noodles
bionemo-noodles
is a Python wrapper of noodles that extends FAIDX to support memmap
-based file I/O for FASTA files.
Installation
To install from PyPI, execute the following command:
pip install bionemo-noodles
Compatibility
bionemo-noodles
has pre-built wheels for Python/Cython 3.10
, 3.11
, and 3.12
, and is compatible with manylinux_2_28
on x86_64
.
For a custom build configuration that is not currently supported on PyPI, reach out to: bionemofeedback@nvidia.com
Usage
An example torch.utils.data.Dataset
using NvFaidx
/ bionemo-noodles
:
import json
from pathlib import Path
import torch
from bionemo.noodles.nvfaidx import NvFaidx
class SimpleFastaDataset(torch.utils.data.Dataset):
def __init__(self, fasta_path: Path, tokenizer):
"""Initialize the dataset."""
super().__init__()
self.fasta = NvFaidx(fasta_path)
self.seqids = sorted(self.fasta.keys())
self.tokenizer = tokenizer
def write_idx_map(self, output_dir: Path):
"""Write the index map to the output directory."""
with open(output_dir / "seq_idx_map.json", "w") as f:
json.dump({seqid: idx for idx, seqid in enumerate(self.seqids)}, f)
def __len__(self):
"""Get the length of the dataset."""
return len(self.seqids)
def __getitem__(self, idx: int) -> dict[str, torch.Tensor]:
"""Get an item from the dataset."""
sequence = self.fasta[self.seqids[idx]].sequence().upper()
tokenized_seq = self.tokenizer.text_to_ids(sequence)
loss_mask = torch.ones_like(torch.tensor(tokenized_seq, dtype=torch.long), dtype=torch.long)
return {
"tokens": torch.tensor(tokenized_seq, dtype=torch.long),
"position_ids": torch.arange(len(tokenized_seq), dtype=torch.long),
"seq_idx": torch.tensor(idx, dtype=torch.long),
"loss_mask": loss_mask,
}
BioNeMo Framework Ecosystem Development
To install this sub-package locally (with --editable
):
pip install -e .
To run unit tests, execute:
pytest -v .
To build wheels for different Python, Linux, and system architecture configurations, run the BioNeMo Sub-Package GitHub Actions Workflow (bionemo-subpackage-ci.yml)