CASTELLA: Long Audio Dataset with Captions and Temporal Boundaries - LY Corporation R&D

Publications

CONFERENCE (INTERNATIONAL) CASTELLA: Long Audio Dataset with Captions and Temporal Boundaries

Hokuto Munakata, Takehiro Imamura (Nagoya University), Taichi Nishimura, Tatsuya Komatsu

2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2026)

May 06, 2026

We introduce CASTELLA, a human-annotated audio benchmark for the task of audio moment retrieval (AMR). Although AMR has various useful potential applications, there is still no established benchmark with real-world data. The initial study of AMR trained the models solely on synthetic datasets. Moreover, the evaluation is based on an annotated dataset of fewer than 100 samples. This resulted in less reliable reported performance. To ensure performance for applications in real-world environments, we present CASTELLA, a large-scale manually annotated AMR dataset. CASTELLA consists of 1009, 213, and 640 audio recordings for training, validation, and test splits, respectively, which is 24 times larger than the previous dataset. We also establish a baseline model for AMR using CASTELLA. Our experiments demonstrate that a model fine-tuned on CASTELLA after pre-training on the synthetic data outperformed a model trained solely on the synthetic data by 10.4 points in Recall1@0.7. CASTELLA is publicly available in https://h-munakata.github.io/CASTELLA-demo/ (external link).

Paper : CASTELLA: Long Audio Dataset with Captions and Temporal Boundaries open into new tab or window (external link)

Software : https://h-munakata.github.io/CASTELLA-demo/ open into new tab or window (external link)

PDF : CASTELLA: Long Audio Dataset with Captions and Temporal Boundaries