Kai Zhen

Applied Scientist at Alexa Speech
Ph.D. in Computer Science and Cognitive Science at Indiana University Bloomington
Curriculum vitae
Kai Zhen

Kai Zhen

News
June 15, 2022: Check out our paper (to be presented at Interspeech'22) highlighting Alexa's recent efforts on Sub-8-Bit quantization for on-device ASR!
Apr 26, 2021: I received the Outstanding Research Award from the Cognitive Science program for my recent dissertation research.
Apr 19, 2021: Joining Amazon Alexa Speech as an applied scientist!
Apr 6, 2021: I successfully defended my Ph.D. dissertation, entitled “Neural Waveform Coding: Scalability, Efficiency and Psychoacoustic Calibration.”

Research
Since joining Amazon, I have been working on on-device ASR where model efficiency is as important as accuracy. As a natural extention to my previous research, I, along with my colleagues, put effort into model sparsification (pruning) and quantization.

Check out our recently published papers as follows.

C-006 Kai Zhen, Hieu Duy Nguyen, Raviteja Chinta, Nathan Susanj, Athanasios Mouchtaris, Tariq Afzal, and Ariya Rastrow, "Sub-8-Bit Quantization Aware Training for 8-Bit Neural Network Accelerator with On-Device Speech Recognition," In Proc. Annual Conference of the International Speech Communication Association (Interspeech), Incheon, Korea, September 18-22, 2022.
[pdf]

C-005 Kai Zhen, Hieu Duy Nguyen, Feng-Ju (Claire) Chang, Athanasios Mouchtaris, and Ariya Rastrow, "Sparsification via Compressed Sensing for Automatic Speech Recognition," in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Toronto, ON, Canada, June 6-12, 2021.
[pdf]

In the course of my Ph.D. studies, I conducted research on neural speech and audio waveform coding, supervised by Prof. Minje Kim. One way to describe this problem is to compress the acoustic waveform into a very compact representation which can be reconstructed with little to no quality degradation. From the neural network quantization's perspective, it is simply to quantize the activation from one specific (usually the bottleneck) layer.

Of course, the data-driven paradigm has built a better ladder; but that may not always "get you to the moon". Usually, the better solution is observed from the marriage between the modern computational framework and conventional domain-specific knowledge. To that end, we proposed ways to incorporated residual coding, linear predictive coding and psychoacoustics in an end-to-end neural waveform codec.

Some of the related publications are

J-002 Kai Zhen, Jongmo Sung, Mi Suk Lee, Seungkwon Beack, Minje Kim, "Scalable and Efficient Neural Speech Coding: A Hybrid Design," IEEE/ACM Transactions on Audio, Speech, and Language Processing (IEEE/ACM TASLP), 30 (2021): 12-25.
[pdf]

J-001 Kai Zhen, Mi Suk Lee, Jongmo Sung, Seungkwon Beack, and Minje Kim, "Psychoacoustic Calibration of Loss Functions for Efficient End-to-End Neural Audio Coding," IEEE Signal Processing Letters (SPL), vol. 27, pp. 2159-2163, 2020, doi: 10.1109/LSP.2020.3039765.. (Also presented at ICASSP 2022)
[demo] [pdf] [code]

C-004 Haici Yang, Kai Zhen, Seungkwon Beack, Minje Kim, "Source-Aware Neural Speech Coding for Noisy Speech Compression," in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Toronto, ON, Canada, June 6-12, 2021.
[pdf]

C-003 Kai Zhen, Mi Suk Lee, Jongmo Sung, Seungkwon Beack, and Minje Kim, "Efficient and Scalable Neural Residual Waveform Coding with Collaborative Quantization," in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Barcelona, Spain, May 4-8, 2020.
[demo] [pdf] [code]

C-001 Kai Zhen, Jongmo Sung, Mi Suk Lee, Seungkwon Beack, and Minje Kim, "Cascaded Cross-Module Residual Learning towards Lightweight End-to-End Speech Coding," In Proc. Annual Conference of the International Speech Communication Association (Interspeech), Graz, Austria, September 15-19, 2019.
[demo] [pdf]

P-005 Mi Suk Lee, Seung Kwon Beack, Jongmo Sung, Tae Jin Lee, Jin Soo Choi, Minje Kim, Kai Zhen, "Method and apparatus for processing audio signal," U.S. Patent Application US20210233547A1.

P-004 Minje Kim, Kai Zhen, Mi Suk Lee, Seung Kwon Beack, Jongmo Sung, Tae Jin Lee, Jin Soo Choi "Residual Coding Method of Linear Prediction Coding Coefficient Based on Collaborative Quantization, and Computing Device for Performing the Method," U.S. Patent Application No. 17/098,090.

P-002 Minje Kim, Kai Zhen, Seungkwon Beack, et al, "Audio Signal Encoding Method and Audio Signal Decoding Method, And Encoder And Decoder Performing the Same," US Patent Application, US20200135220A1.

Find a complete list of my publications from here or my Google Scholar profile .

Professional Activities
Conference Reviewer  
  • ISCA Interspeech - 2022
  • EURASIP European Signal Processing Conference (EUSIPCO) - 2022
  • IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) - 2019 to 2022
  • IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)
  • IEEE International Conference on Data Mining (ICDM), 2020
  • Association for the Advancement of Artificial Intelligence (AAAI) - 2017, 2018
  • Journal Reviewer  
  • IEEE MultiMedia
  • European Association for Signal Processing (EURASIP) Journal on Audio, Speech, and Music Processing

  • Fun Fact
    I used to have 14 rabbits -- well, 2 at first. The mother escaped once and my neighbor helped me get her back. A few months later, my neighbor became an olympic champion, or so I read in the newspaper. The causality is remarkably limited though as I'm having 2 cats, 1 escaped and I got him back, but nothing happened afterwards.



    © Copyright 2022, Kai Zhen