Publications
CONFERENCE (INTERNATIONAL) CASTELLA: Long Audio Dataset with Captions and Temporal Boundaries
Hokuto Munakata, Takehiro Imamura (Nagoya University), Taichi Nishimura, Tatsuya Komatsu
2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2026)
May 06, 2026
We introduce CASTELLA, a human-annotated audio benchmark for the task of audio moment retrieval (AMR). Although AMR has various useful potential applications, there is still no established benchmark with real-world data. The initial study of AMR trained the models solely on synthetic datasets. Moreover, the evaluation is based on an annotated dataset of fewer than 100 samples. This resulted in less reliable reported performance. To ensure performance for applications in real-world environments, we present CASTELLA, a large-scale manually annotated AMR dataset. CASTELLA consists of 1009, 213, and 640 audio recordings for training, validation, and test splits, respectively, which is 24 times larger than the previous dataset. We also establish a baseline model for AMR using CASTELLA. Our experiments demonstrate that a model fine-tuned on CASTELLA after pre-training on the synthetic data outperformed a model trained solely on the synthetic data by 10.4 points in Recall1@0.7. CASTELLA is publicly available in https://h-munakata.github.io/CASTELLA-demo/ (external link).
Paper :
CASTELLA: Long Audio Dataset with Captions and Temporal Boundaries
(external link)
Software :
https://h-munakata.github.io/CASTELLA-demo/
(external link)
PDF : CASTELLA: Long Audio Dataset with Captions and Temporal Boundaries