JOURNAL (INTERNATIONAL) Self-conditioning via Intermediate Predictions for End-to-end Neural Speaker Diarization

Yusuke Fujita, Tetsuji Ogawa (Waseda University), Tetsunori Kobayashi (Waseda University)

IEEE Access

December 07, 2023

This paper presents a speaker diarization model that incorporates label dependency via intermediate predictions. The proposed method is categorized as an end-to-end neural diarization (EEND), which has been a promising method for solving the speaker diarization problem with a multi-label classification neural network. While most EEND-based models assume conditional independence between frame-level speaker labels, the proposed method introduces the label dependency to the models by exploiting the self-conditioning mechanism, which has been originally applied to an automatic speech recognition model. With the self-conditioning mechanism, speaker labels are iteratively refined by taking the whole sequence of intermediate speaker labels as a reference. We demonstrate the effectiveness of self-conditioning in both Transformer-based and attractor-based EEND models. To efficiently train the attractor-based EEND model, we propose an improved attractor computation module named non-autoregressive attractor, which produces speaker-wise attractors simultaneously in a non-autoregressive manner. The experiments with the CALLHOME two-speaker dataset show that the proposed self-conditioning boosts the diarization performance and progressively reduces errors through successive intermediate predictions. In addition, the proposed non-autoregressive attractor improves training efficiency and provides a synergetic boost with self-conditioning, leading to superior performance compared with existing diarization models.

Paper : Self-conditioning via Intermediate Predictions for End-to-end Neural Speaker Diarizationopen into new tab or window (external link)