Data Decoding

When iterating your dataset, the first processing step that each sample goes through, even before Sample Loading, is the decode step.

As explained here, each sample in the underlying .tar file can have multiple entries with different extensions. Just like WebDataset itself, energon uses the auto decode functionality of WebDataset to transform those raw bytes from the tar file entries to usable objects. Images for example shall be decoded to arrays or tensors of pixel data.

In Energon, the class responsible for data decoding is DefaultDecoderWebdatasetFactory (which usually you don’t need to touch directly). In its __init__ method, it contains a piece of code that initializes the auto decoder and tells it which data types shall be decoded and how.

Decoders typically convert

Text to strings (instead of bytes)
JSON to dictionaries and lists
Images, videos and audios to pixel tensors or audio sample tensors
- For other options see below
Torch PTH files to tensors
NPY files to numpy arrays

How to control data decoding

Starting with Energon 7.0.0, the new way to configure data decoding is to specify a decoder as a class variable of your Task Encoder:

class MyTaskEncoder(DefaultTaskEncoder):
    decoder = SampleDecoder(image_decode="pilrgb")

Typically, you will just instantiate a SampleDecoder and provide the arguments to configure it, as shown above. If you do not want automatic decoding at all, you have to explicitely set decoder = None in your TaskEncoder.

Here are the different options you can pass to SampleDecoder:

image_decode (str)
- Can be set to an image decoder from webdataset. Here are some examples:
  - pil: Returns the image as a PIL image
  - torchrgb Returns the image as a torch tensor with 3 color channels.
- For more options, check out the official documentation.
av_decode (str)
- Can be one of AVDecoder, torch, pyav. The default is AVDecoder which is explained below.
- The option torch would decode video and audio entirely and return them as tensors.
- The pyav option is for advanced use cases where you need direct access to the object returned by av.open()
video_decode_audio (bool)
- If True, videos that have an audio track will decode both the video and the audio. Otherwise, only the video frames are decoded.
guess_content (bool)
- New in Energon 7.0.0
- Whether to guess the contents of the file using the filetype package. Useful if you have files without extensions in your data.

Legacy method before Energon 7.0.0

Warning

The below method of configuring auto decoding was deprecated in Energon 7.0.0. Please migrate to the above new method with SampleDecoder

In older versions of Energon, you could pass arguments when calling get_train_dataset or get_val_dataset. The arguments are more or less identical to what can be passed to SampleDecoder above, except:

auto_decode (bool)
- Set to False to disable all automatic decoding of the data. In your sample loader, you will then get raw bytes. The default setting is True
- Setting to False is equivalent to setting decoder = None in the new version.
guess_content (bool)
- Not available in older versions

AV Decoder (Audio and Video)

Energon comes with code to efficiently decode compressed video and audio files such as MP4, MP3 or WAV. It integrates a library called fastseek that allows to quickly extract sub-clips from longer videos. The actual decoding is done by PyAV which in turn relies on ffmpeg.

When choosing AVDecoder as the option for av_decode above, an object of the type AVDecoder (click to see methods) will be returned. At this point, the file has not yet been decoded, but in your custom sample loader or in your TaskEncoder, you can read parts or all of the file using the methods of AVDecoder.

For example to extract from a video, the first 2 seconds and the 2 seconds starting after 4 seconds:

dat = av_decoder.get_clips(
    video_clip_ranges=[(0, 2), (4, 6)],
    video_unit="seconds",
)

# Print both clips' tensor shapes
print(dat.video_clips[0].shape)
print(dat.video_clips[1].shape)

The return value of get_clips will be an object of type AVData. It may contain video or audio tensors or both, depending on what you requested. It will also contain the timestamps of the actually returned clips.

Click on AVData to see the fields and the shapes of the tensors.

You can also use helper methods to extract multiple clips at equal distances all at once:

from megatron.energon.av import get_clips_uniform

dat = get_clips_uniform(
    av_decoder=av_decoder, num_clips=5, clip_duration_seconds=3, request_audio=True
)

This would extract 5 clips, each 3 seconds long also including the corresponding audio clips. The first clip will start at position 0 and the last clip would end at the end of the video. This is essentially just a convenience wrapper around get_clips, so the code may be a good starting point if you’re looking to write a custom extraction logic.

We also provide a method get_single_frames_uniform which will return a tensor of frames directly instead of an AVData object.

The simplest case is to decode the whole video or audio or both:

dat = av_decoder.get_video()

# or
dat = av_decoder.get_audio()

# or
dat = av_decoder.get_video_with_audio()