SPEECH CODING SAMPLES FOR CROSS-MODULE RESIDUAL LEARNING (CMRL)
AND COLLABORATIVE QUANTIZATION (CQ)

Speech codecs learn compact representations of speech signals to facilitate data transmission. Many recent deep neural network (DNN) based end-to-end speech codecs achieve low bitrates and high perceptual quality at the cost of model complexity. We previously proposed a cross-module residual learning (CMRL) pipeline as a module carrier with each autoencoder reconstructing the residual from its preceding modules. By using linear predictive coding (LPC) as a pre-processor, CMRL showed comparable speech quality with the state-of-the-art codecs at ~24 kbps. But the performance is less competitive at lower bitrates.

We now propose a collaborative quantization (CQ) scheme to jointly learn the codebook of LPC coefficients and the corresponding residuals. CQ does not simply shoehorn LPC to a neural network, but bridges the computational capacity of advanced neural network models and traditional, yet efficient and domain-specific digital signal processing methods in an integrated manner. This helps CQ achieve much higher quality than its predecessor at 9 kbps with even lower model complexity.

Bitrate Reference AMR-WB Opus LPC-CMRL (previous version) CQ (newly proposed)
~9 kbps
~9 kbps
~24 kbps
~24 kbps