Publications

CONFERENCE (INTERNATIONAL) Integration of Semi-Blind Speech Source Separation and Voice Activity Detection for Flexible Spoken Dialogue

Masaya Wake (Kyoto University), Masahito Togami, Kazuyoshi Yoshii (Kyoto University), Tatsuya Kawahara (Kyoto University)

Asia-Pacific Signal and Information Processing Association Annual Summit and Conference 2020 (APSIPA ASC 2020)

December 07, 2020

Conventionally, speech separation (SS) and voice activity detection (VAD) have been investigated separately with a different criteria. In natural dialogue systems such as conversational robots, however, it is critical to accurately separate and detect user utterances even while system's speaking. This study addresses the integration of semi-blind source separation (SS) and voice activity detection (VAD) using a single recurrent neural network under the condition that the speech source and voice activity of the system are given. This study investigates three methods of integrated networks where SS and VAD are processed simultaneously or sequentially prioritizing each. The proposed methods input a single-channel microphone observation spectrum, a speech source spectrum, and voice activity of the system, and then output a speech source spectrum and voice activity of the user. Each network adopts long short-term memory (LSTM) to take the dependency of speech into account. An experimental evaluation using a dataset of recorded dialogues between a user and the android ERICA shows the proposed method that conducts two tasks sequentially with SS first achieves the best performance for both SS and VAD.

Paper : Integration of Semi-Blind Speech Source Separation and Voice Activity Detection for Flexible Spoken Dialogueopen into new tab or window (external link)