Skip to content

Nvfaidx

NvFaidx

NvFaidx is a rest + pyo3 replacement for PyFaidx that provides a dictionary-like interface to reference genomes.

NvFaidx is built using Noodles as a backend for Fai objects, and memory maps for backing the underlying fasta. Using a backend of Memmaps provide the following benefits: - The kernel implements this mechanism by using page faults - Each read in a mmap'd file results in a page fault: there's nothing in memory to read! - The kernel handles this page fault by going to the disk, reading the file in the specified offset + index, then returning to the user process with what it just read, preventing penalties from context switching.

Context: PyFaidx or any buffered read based index is not process safe, and therefore does not play nice with pytorch dataloaders. Due to the order of operations, the underlying file handle is shared between processes, when seek() is called to perform random lookups, this can cause unexpected behavior in the forked processes. Ref: https://github.com/mdshw5/pyfaidx/issues/211

For a good solution we need three things

1) Safe index creation, in multi-process or multi-node scenarios, this should be restricted to a single node where all workers block until it is complete (not implemented above) 2) Index object instantion must be fast. 3) Read-only use of the index object must be both thread safe and process safe with python.

Source code in bionemo/noodles/nvfaidx.py
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
class NvFaidx:
    """NvFaidx is a rest + pyo3 replacement for PyFaidx that provides a dictionary-like interface to reference genomes.

    NvFaidx is built using Noodles as a backend for Fai objects, and memory maps for backing the underlying fasta.
    Using a backend of Memmaps provide the following benefits:
        - The kernel implements this mechanism by using page faults
        - Each read in a mmap'd file results in a page fault: there's nothing in memory to read!
        - The kernel handles this page fault by going to the disk, reading the file in the specified offset + index,
            then returning to the user process with what it just read, preventing penalties from context switching.

    *Context*: PyFaidx or _any_ buffered read based index is not process safe, and therefore does not play nice with pytorch dataloaders.
    Due to the order of operations, the underlying file handle is shared between processes, when `seek()` is called to perform random lookups,
    this can cause unexpected behavior in the forked processes.
    Ref: https://github.com/mdshw5/pyfaidx/issues/211

    For a good solution we need three things:
        1) Safe index creation, in multi-process or multi-node scenarios, this should be restricted to a single node
            where all workers block until it is complete (not implemented above)
        2) Index object instantion must be fast.
        3) Read-only use of the index object must be both thread safe and process safe with python.
    """

    def __init__(self, fasta_path: str | Path, faidx_path: Optional[str | Path] = None, ignore_existing_fai=True):
        """Construct a dict-like object representing a memmapped, indexed FASTA file.

        Args:
            fasta_path (str): Path to the FASTA file.
            faidx_path (str): Path to the FAI index file. If None, one will be created.
            ignore_existing_fai (bool): If True, ignore any existing FAI file and create an in-memory index. Note that
                this will also ignore `faidx_path`.
        """
        if isinstance(fasta_path, Path):
            fasta_path = str(fasta_path)
        elif not isinstance(fasta_path, str):
            raise TypeError(f"fasta_path must be a `str` or `pathlib.Path`, got: {type(fasta_path)}")

        if isinstance(faidx_path, Path):
            faidx_path = str(faidx_path)
        elif not isinstance(faidx_path, str) and faidx_path is not None:
            raise TypeError(f"faidx_path must be a `str`, `pathlib.Path`, or None. got: {type(faidx_path)}")

        if ignore_existing_fai:
            self.reader = PyIndexedMmapFastaReader(fasta_path, ignore_existing_fai=ignore_existing_fai)
        elif faidx_path is not None:
            self.reader = PyIndexedMmapFastaReader.from_fasta_and_faidx(fasta_path, faidx_path)
        else:
            # Builds a FAIDX object in memory by default.
            self.reader = PyIndexedMmapFastaReader(fasta_path)

        self.records: Dict[str, PyFaidxRecord] = {record.name: record for record in self.reader.records()}

    def __getitem__(self, seqid: str) -> SequenceAccessor:  # noqa: D105
        if seqid not in self.records:
            raise KeyError(f"Sequence '{seqid}' not found in index.")

        # Return a SequenceAccessor for slicing access
        record_length = self.records[seqid].length
        return SequenceAccessor(self.reader, seqid, record_length)

    def __contains__(self, seqid: str) -> bool:  # noqa: D105
        return seqid in self.records

    def __len__(self) -> int:  # noqa: D105
        return len(self.records)

    def keys(self) -> set[str]:  # noqa: D102
        return set(self.records.keys())

    @staticmethod
    def create_faidx(fasta_filename: str | Path) -> str:
        """Create a FAI index for a FASTA file, the result is saved in the same location as `fasta_filename`, with a .fai extension.

        Args:
            fasta_filename (str): Path to the FASTA file to be indexed.
        """
        if isinstance(fasta_filename, Path):
            fasta_filename = str(fasta_filename)
        return PyIndexedMmapFastaReader.create_faidx(fasta_filename)

__init__(fasta_path, faidx_path=None, ignore_existing_fai=True)

Construct a dict-like object representing a memmapped, indexed FASTA file.

Parameters:

Name Type Description Default
fasta_path str

Path to the FASTA file.

required
faidx_path str

Path to the FAI index file. If None, one will be created.

None
ignore_existing_fai bool

If True, ignore any existing FAI file and create an in-memory index. Note that this will also ignore faidx_path.

True
Source code in bionemo/noodles/nvfaidx.py
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
def __init__(self, fasta_path: str | Path, faidx_path: Optional[str | Path] = None, ignore_existing_fai=True):
    """Construct a dict-like object representing a memmapped, indexed FASTA file.

    Args:
        fasta_path (str): Path to the FASTA file.
        faidx_path (str): Path to the FAI index file. If None, one will be created.
        ignore_existing_fai (bool): If True, ignore any existing FAI file and create an in-memory index. Note that
            this will also ignore `faidx_path`.
    """
    if isinstance(fasta_path, Path):
        fasta_path = str(fasta_path)
    elif not isinstance(fasta_path, str):
        raise TypeError(f"fasta_path must be a `str` or `pathlib.Path`, got: {type(fasta_path)}")

    if isinstance(faidx_path, Path):
        faidx_path = str(faidx_path)
    elif not isinstance(faidx_path, str) and faidx_path is not None:
        raise TypeError(f"faidx_path must be a `str`, `pathlib.Path`, or None. got: {type(faidx_path)}")

    if ignore_existing_fai:
        self.reader = PyIndexedMmapFastaReader(fasta_path, ignore_existing_fai=ignore_existing_fai)
    elif faidx_path is not None:
        self.reader = PyIndexedMmapFastaReader.from_fasta_and_faidx(fasta_path, faidx_path)
    else:
        # Builds a FAIDX object in memory by default.
        self.reader = PyIndexedMmapFastaReader(fasta_path)

    self.records: Dict[str, PyFaidxRecord] = {record.name: record for record in self.reader.records()}

create_faidx(fasta_filename) staticmethod

Create a FAI index for a FASTA file, the result is saved in the same location as fasta_filename, with a .fai extension.

Parameters:

Name Type Description Default
fasta_filename str

Path to the FASTA file to be indexed.

required
Source code in bionemo/noodles/nvfaidx.py
156
157
158
159
160
161
162
163
164
165
@staticmethod
def create_faidx(fasta_filename: str | Path) -> str:
    """Create a FAI index for a FASTA file, the result is saved in the same location as `fasta_filename`, with a .fai extension.

    Args:
        fasta_filename (str): Path to the FASTA file to be indexed.
    """
    if isinstance(fasta_filename, Path):
        fasta_filename = str(fasta_filename)
    return PyIndexedMmapFastaReader.create_faidx(fasta_filename)

SequenceAccessor

SequenceAccessor provides a dictionary-like interface to a single sequence in an indexed FASTA file.

This allows for random access to the sequence, either by index, or by slice.

Source code in bionemo/noodles/nvfaidx.py
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
class SequenceAccessor:
    """SequenceAccessor provides a dictionary-like interface to a single sequence in an indexed FASTA file.

    This allows for random access to the sequence, either by index, or by slice.
    """

    def __init__(self, reader: PyIndexedMmapFastaReader, seqid: str, length: int) -> None:
        """Construct a SequenceAccessor object.

        Args:
            reader (PyIndexedMmapFastaReader): The indexed reader object that provides access to the underlying FASTA file.
            seqid (str): The sequence identifier.
            length (int): The length of the sequence.
        """
        self.reader = reader
        self.seqid = seqid
        self.length = length

    def __getitem__(self, key: int | slice) -> str:  # noqa: D105
        if isinstance(key, slice):
            # Provide defaults for missing arguments in the slice.
            start = key.start if key.start is not None else 0
            stop = key.stop if key.stop is not None else self.length

            # Handle negative cases, remember, you can be arbitrarily negative in a slice.
            if start < 0:
                start += self.length
            if stop < 0:
                stop += self.length

            # Clamp normalized indices to valid range
            start = max(0, min(self.length, start))
            stop = max(0, min(self.length, stop))

            # Bounds checking after normalization
            if start > stop:
                return ""  # Return empty string for an empty slice

            # Construct region string
            region_str = f"{self.seqid}:{start + 1}-{stop}"  # +1 for 1-based indexing
            return self.reader.read_sequence_mmap(region_str)

        elif isinstance(key, int):
            # Normalize single integer for negative indexing
            if key < 0:
                key += self.length

            # Bounds checking
            if key < 0 or key >= self.length:
                raise IndexError(f"Position {key} is out of bounds for '{self.seqid}' with length {self.length}.")

            # Query single nucleotide by creating a 1-length region
            region_str = f"{self.seqid}:{key + 1}-{key + 1}"  # +1 for 1-based indexing
            return self.reader.read_sequence_mmap(region_str)

        else:
            raise TypeError("Index must be an integer or a slice.")

__init__(reader, seqid, length)

Construct a SequenceAccessor object.

Parameters:

Name Type Description Default
reader PyIndexedMmapFastaReader

The indexed reader object that provides access to the underlying FASTA file.

required
seqid str

The sequence identifier.

required
length int

The length of the sequence.

required
Source code in bionemo/noodles/nvfaidx.py
35
36
37
38
39
40
41
42
43
44
45
def __init__(self, reader: PyIndexedMmapFastaReader, seqid: str, length: int) -> None:
    """Construct a SequenceAccessor object.

    Args:
        reader (PyIndexedMmapFastaReader): The indexed reader object that provides access to the underlying FASTA file.
        seqid (str): The sequence identifier.
        length (int): The length of the sequence.
    """
    self.reader = reader
    self.seqid = seqid
    self.length = length