Data-driven approaches these days tend to get rid of the feature engineering step at the cost of model complexity. In speech coding, they are usually termed as "end-to-end" neural codecs which models the speech production process from scratch in the time domain.
However, model complexity matters as much as the performance as these codecs operate on low-power devices! I don't want to drain the battery of my smart agent after it decodes a verse of killing me softly...
One trick is to outsource the response of the vocal tract to linear predictive coding (LPC), a conventional but efficient DSP technique to estimate the spectral envelope of the speech signal. That said, the DNN is only used to quantize the LPC residual. This, of course, is not the end of the story as the residual is rather soft and noisy comparing to the raw waveform, and consequently very hard to be coded with a lightweight network.
Collaborative quantization addresses this issue by making the LPC analyzer a trainable module which can be optimized along with the other deep neural networks. Long story short, it finds a better pivot to allocate bits to digitalize LPC coefficients and quantize the corresponding LPC residuals.
The bitrate for the uncompressed reference signal is 256 kbps.
|Bitrate||Reference||AMR-WB||Opus||LPC-CMRL (previous version)||CQ (newly proposed)|