← Back to Home
Abstract. For on-device automatic speech recognition (ASR), quantization aware training (QAT) is essential to achieve the trade-off between model accuracy and efficiency. We present two novel approaches: S8BQAT, inspired by Lloyd-Max compression theory, and General Quantizer (GQ), a regularization-free method with self-adjustable centroids in μ-Law constrained space. Without accuracy degradation, we compress both RNN-T and Conformer models into sub-8-bit precision—with some RNN-T layers achieving 1-bit quantization. Physical device benchmarking demonstrates 30.73% memory footprint reduction and 31.75% user-perceived latency reduction compared to standard 8-bit QAT.
Motivation
On-device ASR systems face a fundamental challenge: delivering high accuracy while operating under strict memory and latency constraints. Traditional quantization methods suffer from three key limitations:
- Fixed centroids: Quantization levels must be predetermined and cannot adapt during training
- Heavy regularization: Complex training procedures with difficult-to-tune hyperparameters
- Accuracy degradation: Significant performance loss when pushing below 8-bit precision
Our work addresses these limitations through two complementary approaches that draw inspiration from classical signal compression theory.
Impact & Results
Real-World Deployment
This work enabled deployment of more accurate ASR models on millions of Amazon Alexa devices worldwide, improving user experience while reducing computational costs and energy consumption.
Key Results
Memory Reduction
30.73%
vs 8-bit QAT
Latency Reduction
31.75%
User-perceived
Extreme Case
1-bit
Some RNN-T layers
Validated on physical devices: These improvements were measured through actual device benchmarking on Amazon Alexa hardware, not simulation.
S8BQAT: Accuracy Improvement Through Compression
Key insight: By compressing to sub-8-bit, we free up memory to fit larger, more accurate models—and they're still faster than the 8-bit baseline.
| Configuration |
Model Size |
WER Change |
Latency |
| Baseline (8-bit QAT) |
Standard |
— |
— |
| S8BQAT (sub-8-bit) |
Larger model |
4-16% relative WER reduction |
5% faster (despite larger size) |
General Quantizer: Extreme Compression
| Model Architecture |
Quantization Level |
Accuracy Impact |
| RNN-T (Encoder/Decoder) |
Sub-8-bit (some layers to 1-bit) |
No degradation |
| Conformer (Attention-based) |
Sub-8-bit |
No degradation |
Method 1: S8BQAT
Lloyd-Max Inspired Quantization with MRACos Regularizer
Key Innovation
We leverage Lloyd-Max compression theory—a foundational technique in lossy data compression—and introduce a Multi-Regional Absolute Cosine (MRACos) regularizer that aggregates weights toward optimal quantization centroids during training.
32-bit Baseline
→
Lloyd-Max
Centroids
→
MRACos
Training
→
Sub-8-bit
The MRACos regularizer acts as a pseudo-compressor, pushing weights toward their nearest centroid without requiring complex hyperparameter tuning. This approach enables us to increase model capacity within the same memory budget, actually improving accuracy while maintaining efficiency.
Method 2: General Quantizer (GQ)
Regularization-Free Approach with Self-Adjustable Centroids
Key Innovation
Unlike traditional methods with fixed centroids, GQ employs a "soft-to-hard" compression mechanism where quantization centroids are learned during training within a μ-Law constrained space. This eliminates the need for regularization while enabling more versatile quantization.
Soft Quantization
(Learnable)
→
μ-Law Space
Constraint
→
Gradual
Hardening
→
Hard
Quantization
The μ-Law constraint, borrowed from telephony and audio coding, ensures perceptually meaningful compression. This connection to audio coding is natural: just as audio codecs compress waveforms by reducing bit depth while preserving perceptual quality, our method compresses neural networks by reducing weight precision while preserving predictive accuracy.
Technical Insights
Connection to Signal Compression
Both approaches draw from classical signal processing:
- S8BQAT leverages Lloyd-Max quantization, a foundational technique in lossy compression that optimally places quantization levels to minimize mean squared error
- General Quantizer uses μ-Law companding, widely deployed in telephony (G.711) and audio codecs, which provides logarithmic quantization that matches human perception
- The underlying principle: model compression ≈ data compression—reduce bit depth while preserving what matters (perceptual quality for audio, predictive accuracy for models)
Hardware-Aware Design: Memory Bandwidth Bottleneck
Critical insight: Sub-8-bit quantization is applied to weights only. Modern inference is often memory-bandwidth bound, not compute-bound. The bottleneck is loading weights from system memory (DRAM) to on-chip memory (GPU/accelerator).
Weight Loading: The Bandwidth Bottleneck
During inference, weights must be transferred from slow system memory to fast on-chip memory. Sub-8-bit weight quantization directly reduces this bandwidth requirement:
System Memory (DRAM)
Slow, High Capacity
~100 GB/s bandwidth
↓ Weight Transfer (Bandwidth Limited) ↓
On-Chip Memory (GPU)
Fast, Limited Capacity
~1-10 TB/s bandwidth
8-bit Weights
More bytes to transfer from DRAM
Higher bandwidth usage
Sub-8-bit Weights
Fewer bytes to transfer from DRAM
Lower bandwidth usage
Impact: Reducing weight precision from 8-bit to sub-8-bit decreases the amount of data transferred over the memory bus. Since many inference workloads are bandwidth-bound (GPU waiting for weights), this directly improves throughput and latency—critical for online services and real-time applications.
- Memory bandwidth reduction: 30.73% less data to transfer from system memory
- Latency improvement: 31.75% faster inference due to reduced memory bottleneck
- Online service benefit: Higher throughput for serving multiple requests simultaneously
- Edge deployment: Lower memory footprint enables deployment on resource-constrained devices
Publications
Citation
@inproceedings{zhen2022s8bqat,
title={Sub-8-Bit Quantization Aware Training for 8-Bit Neural Network
Accelerator with On-Device Speech Recognition},
author={Zhen, Kai and Nguyen, Hieu Duy and Chinta, Raviteja and
Susanj, Nathan and Mouchtaris, Athanasios and Afzal, Tariq and
Rastrow, Ariya},
booktitle={Interspeech},
year={2022}
}
@inproceedings{zhen2022gq,
title={Sub-8-bit quantization for on-device speech recognition:
a regularization-free approach},
author={Zhen, Kai and Radfar, Martin and Nguyen, Hieu Duy and
Strimel, Grant P. and Susanj, Nathan and Mouchtaris, Athanasios},
booktitle={IEEE SLT},
year={2023}
}