Leveraging Unlabeled Audio for Audio-Text Contrastive Learning via Audio-Composed Text Features - LY Corporation R&D

Publications

CONFERENCE (INTERNATIONAL) Leveraging Unlabeled Audio for Audio-Text Contrastive Learning via Audio-Composed Text Features

Tatsuya Komatsu, Hokuto Munakata, Yuchi Ishikawa

The 26th Annual Conference of the International Speech Communication Association (INTERSPEECH 2025)

August 17, 2025

We propose a novel approach to audio-text contrastive learning that leverages unlabeled audio by introducing audio-composed text features. First, we generate composed audio by additively combining labeled and unlabeled audio. To obtain a text feature aligned with this newly composed audio, we introduce an audio-to-text (a2t) module that transforms the features of unla- beled audio into the corresponding text feature. The newly generated text feature is then concatenated with the original text of the labeled audio and passed through a text encoder to produce the audio-composed text features. By pairing these features with the composed audio for contrastive learning, our approach effectively integrates information from both labeled and unlabeled data. In audio-text retrieval experiments on Clotho and AudioCaps, the proposed method achieves notable improvements in Recall@1, with relative gains of 9.3% and 13.6%, respectively, compared to those trained solely with labeled audio.

Paper : Leveraging Unlabeled Audio for Audio-Text Contrastive Learning via Audio-Composed Text Features open into new tab or window (external link)