MC-Whisper: Extending Speech Foundation Models to Multichannel Distant Speech Recognition - LY Corporation R&D

Publications

JOURNAL (INTERNATIONAL) MC-Whisper: Extending Speech Foundation Models to Multichannel Distant Speech Recognition

Xuankai Chang (Carnegie Mellon University), Pengcheng Guo (Northwestern Polytechnical University), Yuya Fujita, Takashi Maekaku, Shinji Watanabe (Carnegie Mellon University)

IEEE Signal Processing Letters (IEEE SPL)

August 26, 2024

Distant Automatic Speech Recognition (DASR) stands as a crucial aspect in the realm of speech and audio processing. Recent advancements have spotlighted the efficacy of pre-trained speech foundation models, exemplified by Whisper, garnering considerable attention in the speech-processing domain. These models, trained on hundreds of thousands of hours of speech data, exhibit notable strengths in performance and generalization across various zero-shot scenarios. However, a limitation arises from their exclusive handling of single-channel input due to challenges in accumulating extensive multi-channel speech data. The spatial information in the multi-channel input is important for the DASR task. This study introduces an innovation by enabling the incorporation of multi-channel (MC) signals into the pre-trained Whisper model, called MC-Whisper. The proposed model introduces a multi-channel speech processing branch as a sidecar, to maximize the utilization of the foundation model's ability to handle multi-channel input. Experimental results on the distant microphone speech recordings from AMI meeting corpus demonstrate substantial improvements through the proposed approach.

Speech Processing

Paper : MC-Whisper: Extending Speech Foundation Models to Multichannel Distant Speech Recognition open into new tab or window (external link)