Sub-8-Bit Quantization for On-Device ASR

Abstract. For on-device automatic speech recognition (ASR), quantization aware training (QAT) is essential to achieve the trade-off between model accuracy and efficiency. We present two novel approaches: S8BQAT, inspired by Lloyd-Max compression theory, and General Quantizer (GQ), a regularization-free method with self-adjustable centroids in μ-Law constrained space. Without accuracy degradation, we compress both RNN-T and Conformer models into sub-8-bit precision—with some RNN-T layers achieving 1-bit quantization. Physical device benchmarking demonstrates 30.73% memory footprint reduction and 31.75% user-perceived latency reduction compared to standard 8-bit QAT.

Motivation

On-device ASR systems face a fundamental challenge: delivering high accuracy while operating under strict memory and latency constraints. Traditional quantization methods suffer from three key limitations:

Fixed centroids: Quantization levels must be predetermined and cannot adapt during training
Heavy regularization: Complex training procedures with difficult-to-tune hyperparameters
Accuracy degradation: Significant performance loss when pushing below 8-bit precision

Our work addresses these limitations through two complementary approaches that draw inspiration from classical signal compression theory.

Impact & Results

Real-World Deployment

This work enabled deployment of more accurate ASR models on millions of Amazon Alexa devices worldwide, improving user experience while reducing computational costs and energy consumption.

Key Results

Memory Reduction

30.73%

vs 8-bit QAT

Latency Reduction

31.75%

User-perceived

Extreme Case

1-bit

Some RNN-T layers

Validated on physical devices: These improvements were measured through actual device benchmarking on Amazon Alexa hardware, not simulation.

S8BQAT: Accuracy Improvement Through Compression

Key insight: By compressing to sub-8-bit, we free up memory to fit larger, more accurate models—and they're still faster than the 8-bit baseline.

Configuration	Model Size	WER Change	Latency
Baseline (8-bit QAT)	Standard	—	—
S8BQAT (sub-8-bit)	Larger model	4-16% relative WER reduction	5% faster (despite larger size)

General Quantizer: Extreme Compression

Model Architecture	Quantization Level	Accuracy Impact
RNN-T (Encoder/Decoder)	Sub-8-bit (some layers to 1-bit)	No degradation
Conformer (Attention-based)	Sub-8-bit	No degradation

Method 1: S8BQAT

Lloyd-Max Inspired Quantization with MRACos Regularizer

Key Innovation

We leverage Lloyd-Max compression theory—a foundational technique in lossy data compression—and introduce a Multi-Regional Absolute Cosine (MRACos) regularizer that aggregates weights toward optimal quantization centroids during training.

32-bit Baseline

→

Lloyd-Max
Centroids

→

MRACos
Training

→

Sub-8-bit

The MRACos regularizer acts as a pseudo-compressor, pushing weights toward their nearest centroid without requiring complex hyperparameter tuning. This approach enables us to increase model capacity within the same memory budget, actually improving accuracy while maintaining efficiency.

Method 2: General Quantizer (GQ)

Regularization-Free Approach with Self-Adjustable Centroids

Key Innovation

Unlike traditional methods with fixed centroids, GQ employs a "soft-to-hard" compression mechanism where quantization centroids are learned during training within a μ-Law constrained space. This eliminates the need for regularization while enabling more versatile quantization.

Soft Quantization
(Learnable)

→

μ-Law Space
Constraint

→

Gradual
Hardening

→

Hard
Quantization

The μ-Law constraint, borrowed from telephony and audio coding, ensures perceptually meaningful compression. This connection to audio coding is natural: just as audio codecs compress waveforms by reducing bit depth while preserving perceptual quality, our method compresses neural networks by reducing weight precision while preserving predictive accuracy.

Technical Insights

Connection to Signal Compression

Both approaches draw from classical signal processing:

S8BQAT leverages Lloyd-Max quantization, a foundational technique in lossy compression that optimally places quantization levels to minimize mean squared error
General Quantizer uses μ-Law companding, widely deployed in telephony (G.711) and audio codecs, which provides logarithmic quantization that matches human perception
The underlying principle: model compression ≈ data compression—reduce bit depth while preserving what matters (perceptual quality for audio, predictive accuracy for models)

Hardware-Aware Design: Memory Bandwidth Bottleneck

Critical insight: Sub-8-bit quantization is applied to weights only. Modern inference is often memory-bandwidth bound, not compute-bound. The bottleneck is loading weights from system memory (DRAM) to on-chip memory (GPU/accelerator).

Weight Loading: The Bandwidth Bottleneck

During inference, weights must be transferred from slow system memory to fast on-chip memory. Sub-8-bit weight quantization directly reduces this bandwidth requirement:

System Memory (DRAM)

Slow, High Capacity

~100 GB/s bandwidth

↓ Weight Transfer (Bandwidth Limited) ↓

On-Chip Memory (GPU)

Fast, Limited Capacity

~1-10 TB/s bandwidth

8-bit Weights

More bytes to transfer from DRAM

Higher bandwidth usage

Sub-8-bit Weights

Fewer bytes to transfer from DRAM

Lower bandwidth usage

Impact: Reducing weight precision from 8-bit to sub-8-bit decreases the amount of data transferred over the memory bus. Since many inference workloads are bandwidth-bound (GPU waiting for weights), this directly improves throughput and latency—critical for online services and real-time applications.

Memory bandwidth reduction: 30.73% less data to transfer from system memory
Latency improvement: 31.75% faster inference due to reduced memory bottleneck
Online service benefit: Higher throughput for serving multiple requests simultaneously
Edge deployment: Lower memory footprint enables deployment on resource-constrained devices

Publications

📄 S8BQAT Paper (Interspeech 2022) 📄 General Quantizer Paper (IEEE SLT 2023) 🎓 Google Scholar

Citation

@inproceedings{zhen2022s8bqat, title={Sub-8-Bit Quantization Aware Training for 8-Bit Neural Network Accelerator with On-Device Speech Recognition}, author={Zhen, Kai and Nguyen, Hieu Duy and Chinta, Raviteja and Susanj, Nathan and Mouchtaris, Athanasios and Afzal, Tariq and Rastrow, Ariya}, booktitle={Interspeech}, year={2022} } @inproceedings{zhen2022gq, title={Sub-8-bit quantization for on-device speech recognition: a regularization-free approach}, author={Zhen, Kai and Radfar, Martin and Nguyen, Hieu Duy and Strimel, Grant P. and Susanj, Nathan and Mouchtaris, Athanasios}, booktitle={IEEE SLT}, year={2023} }

Sub-8-Bit Quantization for On-Device Automatic Speech Recognition

Motivation

Impact & Results

Real-World Deployment

Key Results

S8BQAT: Accuracy Improvement Through Compression

General Quantizer: Extreme Compression

Method 1: S8BQAT

Lloyd-Max Inspired Quantization with MRACos Regularizer

Key Innovation

Method 2: General Quantizer (GQ)

Regularization-Free Approach with Self-Adjustable Centroids

Key Innovation

Technical Insights

Connection to Signal Compression

Hardware-Aware Design: Memory Bandwidth Bottleneck

Weight Loading: The Bandwidth Bottleneck

8-bit Weights

Sub-8-bit Weights

Publications

Citation