Packing

Packing (sometimes also called sequence packing), enables you to selectively compress multiple input samples into a single sample, for example depending on their length.

This technique is commonly used with large language models, if the input samples have very different lengths leading to lots of padding and hence wasted compute.

This section explains how you can pack samples together and utilize the full context length.

How to pack samples on the fly

To use packing, you need to implement the TaskEncoder methods select_samples_to_pack and pack_selected_samples. Furthermore, you need to initialize the loader with the packing_buffer_size argument set to a non-zero number.

The select_samples_to_pack method will receive a list of samples (size according to the selected packing_buffer_size), and should partition those samples into groups that shall be packed together. Hence the function returns a list of lists of samples.

For each group, the second method pack_selected_samples will be called. You need to implement how a group of samples will be mapped to a single sample. In terms of LLMs for example, this method might concatenate the input tokens.

Warning

You can set the __restore_key__ of the packed sample to an empty tuple, since energon will set the correct restore key afterwards, based on the samples that went in.

Warning

To handle attention masks and tokenized inputs, you will want to operate on a different sample type. The pack_selected_samples method may return a different sample type that is expected as the input for the batch method.

It is important, to mark custom functions like encode_sample and pack_selected_samples as @stateless to allow saving samples for packing. If augmentations happen, it should be marked with @stateless(restore_seeds=True), to deterministically set the seeds based on the TaskEncoder.current_sample_index. You have to make sure the methods are actually stateless, meaning that they will produce the same output when invoked with the same input and random states.

Example for padding for a large language model extending the example from the Task Encoders section:

class PackingCaptioningTaskEncoder(CaptioningTaskEncoder):
    """This class extends the CaptioningTaskEncoder and adds select_samples_to_pack and pack_selected_samples for packing samples
    efficiently on-the-fly.
    Set the `packing_buffer_size` of the get_(train|val)_dataset to an accordingly large number to get a
    properly sized input sample buffer with good diversity.
    """

    @stateless(restore_seeds=True)
    def encode_sample(self, ...):
        # Added `stateless` decorator to allow saving samples for packing. Will set the seed
        # deterministically based on the self.current_sample_index.
        ...

    def select_samples_to_pack(self, samples: List[CaptioningSample]) -> List[List[CaptioningSample]]:
        # Do something intelligent here, e.g. sort by caption length and concat where possible.
        # This could be better, but it's just an example.
        samples.sort(key=lambda x: len(x.caption))
        groups = []
        while len(samples) > 0:
            batch = []
            caption_len = 0
            while len(samples) > 0 and caption_len + len(samples[0].caption) < self.max_length:
                sample = samples.pop(0)
                batch.append(sample)
                caption_len += len(sample.caption)
            groups.append(batch)
        return groups
    
    @stateless
    def pack_selected_samples(self, samples: List[CaptioningSample]) -> CaptioningSample:
        # Construct a new CaptioningSample by concatenating the captions
        ...