Mel Spectrogram#

class MelExtractor#

CPU mel-spectrogram extractor.

Takes mono float32 PCM and produces a host-resident Tensor of shape determined by the config’s layout + padding. Pipeline mirrors HF feature extractors numerically (windowing, hand-rolled radix-2 FFT with direct-DFT fallback for odd sizes, power spectrum, mel filter mat-mul, log, optional post-normalize).

Public Functions

MelExtractor() noexcept#

Default-constructed extractor is empty; runners move-assign one from makeWhisperExtractor / makeParakeetExtractor in validateAndFillConfig before any extract call.

explicit MelExtractor(MelExtractorConfig cfg)#
~MelExtractor()#
MelExtractor(MelExtractor const&) = delete#
MelExtractor &operator=(MelExtractor const&) = delete#
MelExtractor(MelExtractor&&) noexcept#
MelExtractor &operator=(MelExtractor&&) noexcept#
bool extract(AudioPCM const &pcm, Tensor &out)#

Extract mel-spectrogram from pcm into out.

pcm.sampleRate must equal config.sampleRate (caller responsibility — pass the right targetSampleRate to loadAudioBytes).

Parameters:
  • pcm – Mono float32 PCM.

  • out – Populated with mel data on host memory (caller may copy to device).

Returns:

true on success, false on bad input / config mismatch.

inline MelExtractorConfig const &config() const noexcept#
struct Impl#

Public Members

std::vector<float> windowFn#

Length winLength.

std::vector<float> melFilterStorage#

Used only when config.melFilter is null.

float const *melFilterPtr = {nullptr}#
int32_t nBins = {0}#
std::vector<float> preempBuf#

Scratch for full-waveform preemph.

SinCosTable sinCos#

Twiddle factors, size = cfg.nFFT (built at init).

struct MelExtractorConfig#

Configuration for one mel-spectrogram extractor instance.

Public Members

std::string name#

Display name used in log messages.

int32_t sampleRate = {16000}#
int32_t nFFT = {400}#
int32_t hopLength = {160}#
int32_t winLength = {400}#

Window length (typically == nFFT for Whisper, 400 in a 512 FFT for Parakeet).

int32_t nMel = {128}#
float minFrequencyHz = {0.0f}#

Min/max frequency for the mel filter bank. Default 0..sr/2 matches HF Whisper.

float maxFrequencyHz = {-1.0f}#

Negative -> sample_rate / 2.

float preemphCoeff = {0.0f}#

Pre-emphasis filter y[t] = x[t] - coeff * x[t-1]. Disabled when preemphCoeff == 0. When preemphPostScale != 0 the filtered frame is also multiplied by it.

float preemphPostScale = {0.0f}#
WindowType windowType = {WindowType::kHannPeriodic}#
bool windowCentredInFft = {true}#

Where the window sits inside the nFFT-sized FFT input buffer when winLength < nFFT. HF Whisper / Parakeet (torch.stft-style) centre the window: source [start+pad, start+pad+winLen) -> buffer [pad, pad+winLen), pad = (nFFT-winLen)/2. Left-aligned mode (unfold + rfft(n=nFFT)) maps source [start, start+winLen) -> buffer [0, winLen) with trailing zeros. Ignored when winLen == nFFT.

MelScale melScale = {MelScale::kHtk}#
MelNorm melNorm = {MelNorm::kSlaney}#
bool triangulariseInMelSpace = {false}#

When true, build triangle filters with their slopes linear in mel space rather than Hz. Used together with MelScale::kKaldi.

LogType logType = {LogType::kLog10}#
LogFloorMode logFloorMode = {LogFloorMode::kMax}#
float logFloor = {1e-10f}#

Per-FE: Whisper 1e-10, Parakeet 2^-24.

MelLayout layout = {MelLayout::kMelTime}#
PostNormalize postNormalize = {PostNormalize::kWhisperClamp}#
TimePadding timePadding = {TimePadding::kNone}#
int32_t staticTimeLength = {0}#
FramePadding framePadding = {FramePadding::kLeftAlignedZero}#
bool dropLastStftFrame = {false}#

When true, drop the last STFT frame before mel filter / log / post-norm, matching HF Whisper / Parakeet’s stft[..., :-1] and the original Whisper reference. Without this the post-normalize statistics (whisper max-clamp, parakeet per-feature mean/std) drift from HF by O(1e-1) at frame boundaries even though the underlying mel-power values are byte-identical.

float const *melFilter = {nullptr}#

Pointer to a precomputed mel filter bank of shape [nMel × (nFFT/2 + 1)] in row-major order. Generated offline by scripts/gen_mel_filter_bank.py and embedded as a static array. Lifetime must outlive the extractor (typically pointer to a static constexpr table).