カンファレンス (国際) Multi-channel separation of dynamic speech and sound events

Takuya Fujimura (Nagoya University), Robin Scheibler

The 24th Annual Conference of the International Speech Communication Association (INTERSPEECH 2023)


We propose a multi-channel separation method for moving sound sources. We build upon a recent beamformer for a moving speaker using attention-based tracking. This method uses an attention mechanism to compute the time-varying spatial statistics which enables tracking the moving source. While this prior work aimed to extract a single target source, we simultaneously estimate multiple sources. Our main technical contribution is to introduce attention-based tracking into the iterative source steering algorithm for independent vector analysis (IVA), enabling joint estimation of multiple sources. We experimentally show that the proposed method greatly improves the separation performance for moving speakers, including an absolute reduction of 27.2% in word error rate compared to time-invariant IVA. In addition, we demonstrate that the proposed method is effective as a pre-processing for sound event detection, showing an improvement in F1 scores of up to 4.7% in real recordings.

Paper : Multi-channel separation of dynamic speech and sound events新しいタブまたはウィンドウで開く (外部サイト)