Publications
CONFERENCE (INTERNATIONAL) On Sorting and Padding Multiple Targets for Sound Event Localization and Detection with Permutation Invariant and Location-based Training
Robin Scheibler, Tatsuya Komatsu, Yusuke Fujita, Michael Hentschel
Asia-Pacific Signal and Information Processing Association Annual Summit and Conference 2022 (APSIPA ASC 2022)
November 07, 2022
We explore the performance of permutation invariant and location-based training (PIT and LBT, respectively) for sound event localization and detection (SELD). Due to being intrinsically a multi-output multi-class and multi-task problem, the design space of loss functions for SELD is large, and, as of yet, rather unexplored. Our study revolves around the multiple activity coupled direction of arrival target format which cleverly combines direction and event probability into a single mean squared error loss. While PIT, and its variant auxiliary duplicating PIT (ADPIT), have been prominently featured in recent DCASE challenges, LBT has not yet been applied to SELD. In this work, we investigate some modifications to PIT and ADPIT, as well as the application of LBT to SELD. First, the PIT loss is changed to have a variable number of tracks per event class, providing extra flexibility. Second, we propose auxiliary duplicating or silence PIT (ADPIT-S), where unused tracks are indifferently filled with a duplicate event, or nothing. Finally, we propose to use LBT with ordering of the events by Cartesian or polar coordinates. We give two ways of padding the unused tracks, with zeros or by repeating the last event. We conduct experiments using the STARSS22 dataset from the DCASE Challenge 2022. We find that ordering by Cartesian coordinates with repeat padding is best for LBT. When comparing all loss functions, we suprisingly found that PIT performed the best. In addition, LBT turned out to be competitive with PIT and ADPIT. While ADPIT-S had slightly worse overall performance, it did better in terms of error rate and F-score metrics.
Paper : On Sorting and Padding Multiple Targets for Sound Event Localization and Detection with Permutation Invariant and Location-based Training (external link)