Types

`BertSample`

Bases: TypedDict

The type expected by NeMo/Megatron for a single dataset item.

Attributes:

Name	Type	Description
`text`	`Tensor`	The tokenized, masked input text.
`types`	`Tensor`	The token type ids, if applicable.
`attention_mask`	`Tensor`	A mask over all valid tokens, excluding padding.
`labels`	`Tensor`	The true values of the masked tokens at each position covered by loss_mask.
`loss_mask`	`Tensor`	The mask over the text indicating which tokens are masked and should be predicted.
`is_random`	`Tensor`	??

Source code in bionemo/llm/data/types.py

class BertSample(TypedDict):
    """The type expected by NeMo/Megatron for a single dataset item.

    Attributes:
        text: The tokenized, masked input text.
        types: The token type ids, if applicable.
        attention_mask: A mask over all valid tokens, excluding padding.
        labels: The true values of the masked tokens at each position covered by loss_mask.
        loss_mask: The mask over the text indicating which tokens are masked and should be predicted.
        is_random: ??
    """

    text: Tensor
    types: Tensor
    attention_mask: Tensor
    labels: Tensor
    loss_mask: Tensor
    is_random: Tensor

`Tokenizer`

Bases: Protocol

Required attributes for a tokenizers provided to apply_bert_pretraining_mask.

Source code in bionemo/llm/data/types.py

class Tokenizer(Protocol):
    """Required attributes for a tokenizers provided to apply_bert_pretraining_mask."""

    @property
    def mask_token_id(self) -> int | None:  # noqa: D102
        ...

    @property
    def all_special_ids(self) -> list[int]:  # noqa: D102
        ...