カンファレンス (国際) End-to-End Multi-speaker ASR with Independent Vector Analysis

Robin Scheibler, Wangyou Zhang (Shanghai Jiao Tong University), Xuankai Chang (Carnegie Mellon University), Shinji Watanabe (Carnegie Mellon University), Yanmin Qian (Shanghai Jiao Tong University)

The 2022 IEEE Spoken Language Technology Workshop (SLT 2022)


We develop an end-to-end system for multi-channel, multi-speaker automatic speech recognition. We propose a frontend for joint source separation and dereverberation based on the independent vector analysis (IVA) paradigm. It uses the fast and stable iterative source steering algorithm together with a neural source model. Unlike conventional neural beamforming, the number of speakers can be dynamically changed during or after training. The parameters from the ASR module and the neural source model are optimized jointly from the ASR loss itself. We demonstrate competitive performance with previous systems using neural beamforming frontends with only one-ninth of the trainable parameter. First, we explore the trade-offs when using various number of channels for training and testing. Second, we demonstrate that the proposed IVA frontend performs well on noisy data, even when trained on clean mixtures only. Third, we demonstrate recognition of mixtures of three and four speakers with a model trained on mixtures of two only.

Paper : End-to-End Multi-speaker ASR with Independent Vector Analysis新しいタブまたはウィンドウで開く (外部サイト)