End-to-end Automatic Speech Recognition with Independent Vector Analysis Frontend - LY Corporation R&D

Publications

CONFERENCE (DOMESTIC) End-to-end Automatic Speech Recognition with Independent Vector Analysis Frontend

シャイブラーロビン, Zhang Wangyou (Shanghai Jiao Tong University), Chang Xuankai (Shanghai Jiao Tong University), 渡部晋治 (Carnegie Mellon University), Qian Yanmin (Shanghai Jiao Tong University)

日本音響学会 2022年秋季研究発表会 (ASJ 2022 autumn)

September 14, 2022

Our focus revolves around end-to-end (E2E) multi-channel, multi-speaker automatic speech recognition (ASR). While Neural Beamformer (NBF) effectively separates speakers, it necessitates groundtruth waveforms for each mixture component. To tackle this, our approach integrates Joint E2E optimization of frontend and ASR systems, as proposed in the MIMO-speech framework. Unlike conventional NBF, the adaptable neural separation network in Time-Decorrelation Iterative Source Steering (T-ISS) is single-input single-output (SISO), enabling dynamic changes in speaker count. Additionally, T-ISS demonstrates resilience to training/test data distribution disparities. Our study explores T-ISS as a MIMO-speech frontend, extending it to the overdetermined scenario and showcasing its robustness and adaptability through compelling experiments.

Speech Processing