QuartzNet

QuartzNet is a version of Jasper [ASR-MODELS1] model with separable convolutions and larger filters. It can achieve performance similar to Jasper but with an order of magnitude less parameters. Similarly to Jasper, QuartzNet family of models are denoted as QuartzNet_[BxR] where B is the number of blocks, and R - the number of convolutional sub-blocks within a block. Each sub-block contains a 1-D separable convolution, batch normalization, ReLU, and dropout:

These models are trained on Google Speech Commands dataset (V1 - all 30 classes).

quartznet model

Note

This checkpoint was trained on LibriSpeech [2] and full “validated” part of En Mozilla Common Voice [1]

QuartzNet paper.

These QuartzNet models were trained for 200 epochs using mixed precision on 2 GPUs with a batch size of 128 over 200 epochs. On 2 Quadro GV100 GPUs, training time is approximately 1 hour.

Network

Dataset

Results

QuartzNet3x1 (77k params)

Speech Commands V1

97.32% Test

QuartzNet3x2 (93k params)

Speech Commands V1

97.69% Test

QuartzNet3x1 (77k params)

Speech Commands V2

97.12% Test

QuartzNet3x2 (93k params)

Speech Commands V2

97.29% Test

References