Sub-8-Bit Quantization for On-Device Automatic Speech Recognition

Kai Zhen, Hieu Duy Nguyen, Raviteja Chinta, Grant P. Strimel, Martin Radfar, Nathan Susanj, Athanasios Mouchtaris, Tariq Afzal, Ariya Rastrow
Amazon Alexa AI
* This page was generated by Amazon AGI - I only prompted...
Interspeech 2022 IEEE SLT 2023
← Back to Home
Abstract. For on-device automatic speech recognition (ASR), quantization aware training (QAT) is essential to achieve the trade-off between model accuracy and efficiency. We present two novel approaches: S8BQAT, inspired by Lloyd-Max compression theory, and General Quantizer (GQ), a regularization-free method with self-adjustable centroids in μ-Law constrained space. Without accuracy degradation, we compress both RNN-T and Conformer models into sub-8-bit precision—with some RNN-T layers achieving 1-bit quantization. Physical device benchmarking demonstrates 30.73% memory footprint reduction and 31.75% user-perceived latency reduction compared to standard 8-bit QAT.

Motivation

On-device ASR systems face a fundamental challenge: delivering high accuracy while operating under strict memory and latency constraints. Traditional quantization methods suffer from three key limitations:

Our work addresses these limitations through two complementary approaches that draw inspiration from classical signal compression theory.

Impact & Results

Real-World Deployment

This work enabled deployment of more accurate ASR models on millions of Amazon Alexa devices worldwide, improving user experience while reducing computational costs and energy consumption.

Key Results

Memory Reduction
30.73%
vs 8-bit QAT
Latency Reduction
31.75%
User-perceived
Extreme Case
1-bit
Some RNN-T layers

Validated on physical devices: These improvements were measured through actual device benchmarking on Amazon Alexa hardware, not simulation.

S8BQAT: Accuracy Improvement Through Compression

Key insight: By compressing to sub-8-bit, we free up memory to fit larger, more accurate models—and they're still faster than the 8-bit baseline.

Configuration Model Size WER Change Latency
Baseline (8-bit QAT) Standard
S8BQAT (sub-8-bit) Larger model 4-16% relative WER reduction 5% faster (despite larger size)

General Quantizer: Extreme Compression

Model Architecture Quantization Level Accuracy Impact
RNN-T (Encoder/Decoder) Sub-8-bit (some layers to 1-bit) No degradation
Conformer (Attention-based) Sub-8-bit No degradation

Method 1: S8BQAT

Lloyd-Max Inspired Quantization with MRACos Regularizer

Key Innovation

We leverage Lloyd-Max compression theory—a foundational technique in lossy data compression—and introduce a Multi-Regional Absolute Cosine (MRACos) regularizer that aggregates weights toward optimal quantization centroids during training.

32-bit Baseline
Lloyd-Max
Centroids
MRACos
Training
Sub-8-bit

The MRACos regularizer acts as a pseudo-compressor, pushing weights toward their nearest centroid without requiring complex hyperparameter tuning. This approach enables us to increase model capacity within the same memory budget, actually improving accuracy while maintaining efficiency.

Method 2: General Quantizer (GQ)

Regularization-Free Approach with Self-Adjustable Centroids

Key Innovation

Unlike traditional methods with fixed centroids, GQ employs a "soft-to-hard" compression mechanism where quantization centroids are learned during training within a μ-Law constrained space. This eliminates the need for regularization while enabling more versatile quantization.

Soft Quantization
(Learnable)
μ-Law Space
Constraint
Gradual
Hardening
Hard
Quantization

The μ-Law constraint, borrowed from telephony and audio coding, ensures perceptually meaningful compression. This connection to audio coding is natural: just as audio codecs compress waveforms by reducing bit depth while preserving perceptual quality, our method compresses neural networks by reducing weight precision while preserving predictive accuracy.

Technical Insights

Connection to Signal Compression

Both approaches draw from classical signal processing:

Hardware-Aware Design: Memory Bandwidth Bottleneck

Critical insight: Sub-8-bit quantization is applied to weights only. Modern inference is often memory-bandwidth bound, not compute-bound. The bottleneck is loading weights from system memory (DRAM) to on-chip memory (GPU/accelerator).

Weight Loading: The Bandwidth Bottleneck

During inference, weights must be transferred from slow system memory to fast on-chip memory. Sub-8-bit weight quantization directly reduces this bandwidth requirement:

System Memory (DRAM)
Slow, High Capacity
~100 GB/s bandwidth
↓ Weight Transfer (Bandwidth Limited) ↓
On-Chip Memory (GPU)
Fast, Limited Capacity
~1-10 TB/s bandwidth

8-bit Weights

More bytes to transfer from DRAM

Higher bandwidth usage

Sub-8-bit Weights

Fewer bytes to transfer from DRAM

Lower bandwidth usage

Impact: Reducing weight precision from 8-bit to sub-8-bit decreases the amount of data transferred over the memory bus. Since many inference workloads are bandwidth-bound (GPU waiting for weights), this directly improves throughput and latency—critical for online services and real-time applications.

Publications

Citation

@inproceedings{zhen2022s8bqat, title={Sub-8-Bit Quantization Aware Training for 8-Bit Neural Network Accelerator with On-Device Speech Recognition}, author={Zhen, Kai and Nguyen, Hieu Duy and Chinta, Raviteja and Susanj, Nathan and Mouchtaris, Athanasios and Afzal, Tariq and Rastrow, Ariya}, booktitle={Interspeech}, year={2022} } @inproceedings{zhen2022gq, title={Sub-8-bit quantization for on-device speech recognition: a regularization-free approach}, author={Zhen, Kai and Radfar, Martin and Nguyen, Hieu Duy and Strimel, Grant P. and Susanj, Nathan and Mouchtaris, Athanasios}, booktitle={IEEE SLT}, year={2023} }