DETECLAP: Enhancing Audio-Visual Representation Learning with Object Information - LY Corporation R&D

Publications

CONFERENCE (INTERNATIONAL) DETECLAP: Enhancing Audio-Visual Representation Learning with Object Information

Shota Nakada, Taichi Nishimura, Hokuto Munakata, Masayoshi Kondo, Tatsuya Komatsu

2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2025)

April 08, 2025

Current audio-visual representation learning can capture rough object categories (e.g., ``animals'' and ``instruments''), but it lacks the ability to recognize fine-grained details, such as specific categories like ``dogs'' and ``flutes'' within animals and instruments. To address this issue, we introduce DETECLAP, a method to enhance audio-visual representation learning with object information. Our key idea is to introduce an audio-visual label prediction loss to the existing Contrastive Audio-Visual Masked AutoEncoder to enhance its object-awareness. To avoid costly manual annotations, we prepare object labels from both audio and visual inputs using the state-of-the-art language-audio models and object detectors. We evaluate the method on audio-visual retrieval and classification using the VGGSound and AudioSet20K datasets. Our method achieves improvements in recall@10 of +1.5% and +1.2% for audio-to-visual and visual-to-audio retrieval, respectively, and an improvement in accuracy of +0.6% for audio-visual classification.

Multimodal

Paper : DETECLAP: Enhancing Audio-Visual Representation Learning with Object Information open into new tab or window (external link)

PDF : DETECLAP: Enhancing Audio-Visual Representation Learning with Object Information