Sound Event Localization and Detection with pre-trained Audio Spectrogram Transformer and Multichannel Separation Network - LY Corporation R&D

Publications

WORKSHOP (INTERNATIONAL) Sound Event Localization and Detection with pre-trained Audio Spectrogram Transformer and Multichannel Separation Network

Robin Scheibler, Tatsuya Komatsu, Yusuke Fujita, Michael Hentschel

Detection and Classification of Acoustic Scenes and Events (DCASE 2022)

November 03, 2022

We propose a sound event localization and detection system based on a CNN-Conformer base network. Our main contribution is to evaluate the use of pre-trained elements in this system. First, a pre-trained multichannel separation network allows to separate overlapping events. Second, a fine-tuned self-supervised audio spectrogram transformer provides a priori classification of sound events in the mixture and separated channels. We propose three different architectures combining these extra features into the base network. We first train on the STARSS22 dataset extended by simulation using events from FSD50K and room impulse responses from previous challenges. To bridge the gap between the simulated dataset and the STARSS22 dataset, we fine-tune the models on the training part of the STARSS22 development dataset only before the final evaluation. Experiments reveal that both the pre-trained separation and classification models enhance the final performance, but the extent depends on the adopted network architecture.

Speech Processing

Paper : Sound Event Localization and Detection with pre-trained Audio Spectrogram Transformer and Multichannel Separation Network open into new tab or window (external link)