CAVIARES: Corpus for Audio-Visual Expressive Voice Agent - LY Corporation R&D

Publications

WORKSHOP (INTERNATIONAL) CAVIARES: Corpus for Audio-Visual Expressive Voice Agent

Jinsheng Chen (The University of Tokyo), Yuki Saito (The University of Tokyo), Dong Yang (The University of Tokyo), Naoko Tanji (The University of Tokyo), Hironori Doi, Byeongseon Park, Yuma Shirahata, Kentaro Tachibana, Hiroshi Saruwatari (The University of Tokyo)

2025 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU 2025)

December 09, 2025

High-quality audio-visual corpora are essential for building voice agents capable of natural human-machine communication, but existing corpora commonly contain a limited amount of data per speaker, making personalized modeling difficult. We present CAVIARES, a new audio-visual corpus comprising 9.5 hours of expressive speech recorded by a single professional Japanese female speaker. CAVIARES consists of two subsets: acted dialogue and expressive reading, providing a diverse range of speaking styles for speech-to-facial motion modeling and multimodal learning tasks. In this paper, we describe the construction process of CAVIARES and the results of corpus analysis. CAVIARES will be released for research purposes only.