utils

Utils for speculative decoding.

Classes

`AcceptanceRateValidation`	Base acceptance rate (AR) validation class.
`ResBlock`	A Residual Block module.

Functions

`calibrate_frequent_vocab`	Given a calibration text, find the most common vocabs and return the mapping.
`get_default_attention_mask_and_position_ids`	Compute default attention_mask ans position_ids given input_ids.
`tree_decode`	Decode tokens using the tree.

class AcceptanceRateValidation

Bases: object

Base acceptance rate (AR) validation class.

This class is used to validate the AR within ModelOpt. self.validate is the main function to validate the AR given a prompt or input_ids. Note: currently it only supports TP.

__init__(model, tokenizer): Init function to take in the model and tokenizer.

check_data_consistancy_across_ranks(data, group=None)

This function checks the data consistancy across all ranks in the group.

Use rank 0 data as the golden set to broadcast to all ranks. Each rank will then compare to this data and through error if different.

check_draft(ground_truth, input_ids, draft_tokens, tree=None)

This function checks if the draft tokens should be accepted (same as ground truth).

If tree is None, it is eager mode.

get_ground_truth(input_ids, osl)

This function returns ground truth token ids from the base model.

This function will be implemented in the plugins.

Parameters:

input_ids (Tensor) – the token ids of the input
osl (int) – output sequence length

tokenize(prompt): Apply chat template to the prompt and get input_ids.

validate(osl, prompt=None, input_ids=None, ground_truth=None, tree=None, steps=1): This function validate the AR of the model given the input sequence.

class ResBlock

Bases: Module

A Residual Block module.

This module performs a linear transformation followed by a SiLU activation, and then adds the result to the original input, creating a residual connection.

__init__(hidden_size, bias=True)

Init function of ResBlock.

Parameters:

hidden_size (int) – The size of the hidden layers in the block.
bias (bool)

forward(x)

Forward pass of the ResBlock.

Parameters:: x (Tensor) – Input tensor.
Returns:: Output after the residual connection and activation.
Return type:: Tensor

calibrate_frequent_vocab(tokenizer, text, target_vocab_size, output_file=None): Given a calibration text, find the most common vocabs and return the mapping.

get_default_attention_mask_and_position_ids(input_ids)

Compute default attention_mask ans position_ids given input_ids.

Parameters:: input_ids (Tensor)

tree_decode(draft_logits, tree)

Decode tokens using the tree.

Parameters:

draft_logits (list[Tensor]) – a list of logits. Each logit represent a future position.
tree (list[list[int]]) – a tree for decoding. Each sublist is a branch from root where the number
index. (represents the topk)